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preface 


In selecting the material for this book, the primary goal was to facilitate 
the training of students to become better researchers. It is our belief 
that this goal can be reached most effectively by providing the student 
with essential information in certain key areas of experimentation. 
Therefore, the readings which we have selected are intended primarily 
to provide a clearer description and understanding ofsome uncontrolled 
factors which can affect the outcome of an experiment independently 
of the variables deliberately manipulated. 

While it is generally conceded that the scientific method is unsur- 
passed for the acquisition of information, the naive adoption of the 
experimental model used with inanimate objects in physics has at 
times proved troublesome to the behavioral scientists. This statement 
is supported by the concern shown by behavioral scientists with the 
difficulties experienced in attempting to replicate their own findings as 
well as those of others. Despite increasing technological innovation 
and refinement of scientific tools, failures at replication occur too 
frequently. It would appear, therefore, that a more careful assessment 
of the total experiment by the behavioral investigator is necessary. 

The organization of these articles follows no rigid pattern. The 
student can begin with any block of readings he finds most appealing. 


Training students to be better researchers can also be enhanced 
by providing them with information bearing on the logical issues 
involved in current data evaluation practices. The final block of 
readings brings together some of the most fundamental and enduring 
issues related to the statistical evaluation of research findings. Several 
of the readings take issue with and question the test of significance in 
psychological research. Considerable emphasis is placed on the 
controversy surrounding null hypothesis testing. Classical and 
Bayesian approaches to data analyses are discussed, and the difference 
between theory testing in physics and in psychology is described. 
Further empirical and theoretical effort concerning these contro- 
versies is inevitable. Students who acquaint themselves with today’s 
issues in these areas will unquestionably be better prepared for future 
developments. 

We are most grateful to the many authors and publishers who 
permitted us to use their articles. 


March 1970 R. 
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Research problems which are typically not emphasized in courses 
dealing with design and statistics are included in this first block of 
readings. If titles were to be used for each block, the most appropriate 


7 


2 Research Problems in Psychology 


made repeatedly that 


setting in many varied stigators have begun to 


ntended sources of bias 
relevant data identifying 
efore, more investigators 
sional effort to a critical 


behavioral experi 


Possible negative effe 
from frequent 


Judging fr hed in the various readings of 
this first block, i i 


on the social psychology 
of the psychological 
experiment : 


with particular reference to demand 
characteristics and their implications’ 


Martin T. Orne? 


It is to the highest degree probable that the subject[’s]... general 
attitude of mind is that of ready complacency and cheerful willingness 
to assist the investigator in every possible way by reporting to him those 
very things which he is most eager to find, and that the very questions of the 
experimenter . . . suggest the shade of reply expected... - Indeed .. . it 
seems too often as if the subject were now regarded as a stupid autom- 
aton... . 

A. H. PIERCE, 1908? 


From American Psychologist, Vol. 17, 1962, pp. 776-783. i Reprinted by permis- 
sion of the author and the American Psychological Association. 


1. This paper was presented at the Symposium, “On the Social Psychology of 
the Psychological Experiment,” ‘American Psychological Association Con- 
vention, New York, 1961. i 
The work reported here was supported in part by a Public Health Service 
Research Grant, M-3369, National Institute of Mental Health. 


2. I wish to thank my associates Ronald E. Shor, Donald N. O'Connell, 
Ulric Neisser, Karl E. Scheibe, and Emily F. Carota for their comments 
and criticisms in the preparation of this paper. 

3. See reference list (Pierce, 1908). 
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Since the time of Galileo, scientists have employed the laboratory 
experiment as a method of understanding natural phenomena. 
Generically, the experimental method consists of abstracting relevant 
variables from complex situations in nature and reproducing in the 
laboratory segments of these situations, varying the parameters 
involved so as to determine the effect of the experimental variables. 
This procedure allows generalization from the information obtained 
in the laboratory situation back to the original situation as it occurs 
in nature. The physical sciences have made striking advances through 
the use of this method, but in the behavioral sciences it has often been 
difficult to meet two necessary requirements for meaningful experi- 
mentation: reproducibility and ecological validity.* It has long been 
recognized that certain differences will exist between the types of 
experiments conducted in the physical sciences and those in the 
behavioral sciences because the former investigates a universe of 
inanimate objects and forces, whereas the latter deals with animate 
organisms, often thinking, conscious subjects. However, recognition 
of this distinction has not always led to appropriate changes in the 
traditional experimental model of physics as employed in the behavioral 
sciences. Rather the experimental model has been so successful as 
employed in physics that there has been a tendency in the behavioral 
sciences to follow precisely a paradigm originated for the study of 
_ inanimate objects, i.e., one which proceeds by exposing the subject to 
Various ‘conditions and observing the differences in reaction of the 
FE bject under different conditions. However, the use of such a model 
= animal or human subjects leads to the problem that the subject 
_ ofthe experiment is assumed, at least implicitly, to be a passive responder 
to stimuli—an assumption difficult to justify. Further, in this type of 
model the experimental stimuli themselves are usually rigorously 
defined in terms of what is done to the subject. In contrast, the purpose 
of this paper will be to focus on what the human subject does in the 
laboratory: what motivation: the subject is likely to have in the 
experimental situation, how he usually perceives behavioral research, 
what the nature of the cues is that the subject is likely to pick up, ete. 
Stated in other terms, what factors are apt to affect the subject’s 
reaction to the well-defined stimuli in the situation? These factors 
comprise what will be referred to here as the “experimental setting.” 


` 4, Ecological validity, in the sense that Brunswik (1947) has used the term: 


appropriate generalization from the laboratory to nonexperimental situa- 
tions. 


ee 
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Since any experimental manipulation of human subjects takes 
place within this larger framework or setting, we should propose that 
the above-mentioned factors must be further elaborated and the 
parameters of the experimental setting more carefully defined so that 
adequate controls can be designed to isolate the effects of the experi- 
mental setting from the effects of the experimental variables. Later in 
this paper we shall propose certain possible techniques of control 
which have been devised in the process of our research on the nature of 
hypnosis. 

Our initial focus here will be on some of the qualities peculiar to 
psychological experiments. The experimental situation is one which 
takes place within the context of an explicit agreement of the subject 
to participate in a special form of social interaction known as “taking 
part in an experiment.” Within the context of our culture the roles of 
subject and experimenter are well understood and carry with them 
well-defined mutual role expectations. A particularly striking aspect 
of the typical experimenter-subject relationship is the extent to which 
the subject will play his role and place himself under the control of the 
experimenter. Once a subject has agreed to participate in a psycho- 
logical experiment, he implicitly agrees to perform a very wide range 
of actions on request without inquiring as to their purpose, and 
frequently without inquiring as to their duration. 

Furthermore, the subject agrees to tolerate a considerable degree 
of discomfort, boredom, or actual pain, if required to do so by the 
experimenter. Just about any request which could conceivably be 


asked of the subject by a reputable investigator is legitimized by the 


quasi-magical phrase, “This is an experiment,” and the shared assump- 
ubject’s behavior. 


tion that a legitimate purpose will be served by the s \ 
A somewhat trivial example of this legitimization of requests is as 
follows: 

A number of casual acquaintances were asked whether they 
Would do the experimenter a favor; on their acquiescence, they were 
asked to perform five push-ups. Their response tended to be amaze- 
Ment, incredulity and the question “why?” Another similar group of 
individuals were asked whether they would take part in an experiment 
Of brief duration. When they agreed to do so, they too were asked to 
Perform five push-ups. Their typical response was “Where?” 


The striking degree of control inherent in the experimental situation 
i xperiments which were per- 


eriment to test whether the 
Jationship is greater than 
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that ina waking relationship. 


$ In order to test this question, we tried to 
develop a set of tasks whic’ 


h waking subjects would refuse to do, or 
would do only for a short period of time. The tasks were intended to be 
Psychologically noxious, meaningless, or boring, rather than painful 
or fatiguing. š 


For example, one task was to perform serial additions of each 
adjacent two numbers on sheets filled with rows of random digits. 
In order to complete just one sheet, the subject would be required to 
perform 224 additions! A stack of some 2,000 sheets was presented to 
each subject—clearly an impossible task to complete. After the 
instructions were given, the subject was deprived of his watch and told, 
“Continue to work; I will return eventually.” Five and one-half hours 
later, the experimenter gave up! In general, subjects tended to continue 
this type of task for several hours, usually with little decrement in 
performance. Since we were trying to find a task which would be 


discontinued spontaneously within a brief period, we tried to create a 
more frustrating situation as follows: 

Subjects were asked to perform the same task described above but 
were also told that when finished the additions on each sheet, they should 
pick up acard froma large pile, which would instruct them on what to, 
do next, However, every card in the pile read, 


You are to tear up the sheet of paper which you have just completed into 

a minimum of thirty-two pieces and go on to the next sheet of paper and 

continue working as you did before; when you have completed this 

piece of paper, pick up the next card which will instruct you further. 
ork as accurately and as rapidly as you can. 

Our expectation was that subjects would discontinue the task as 
soon as they realized that the cards were worded identically, that each 
finished piece of work had to be destroyed, and that, in short, the task 
was completely meaningless, 

Somewhat to our amazement, subjects tended to persist in the 
task for several hours 


5. These pilot studies were performed by Thomas Menaker, 
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_ Thus far, we have been singularly unsuccessful in finding an 
experimental task which would be discontinued, or, indeed, refused by 
subjects in an experimental setting.” Not only do subjects continue 
to perform boring, unrewarding tasks, but they do so with few errors 
and little decrement in speed. It became apparent that it was extremely 
difficult to design an experiment to test the degree of social control in 
hypnosis, in view of the already very high degree of control in the 
experimental situation itself. 

The quasi-experimental work reported here is highly informal and 
based on samples of three or four subjects in each group. It does, 
however, illustrate the remarkable compliance of the experimental 
subject. The only other situations where such a wide range of requests 
are carried out with little or no question are those of complete 
authority, such as some parent-child relationships or some doctor- 
patient relationships. This aspect of the experiment as a social situation 
Will not become apparent unless one tests for it; it is, however, present 
in varying degrees in all experimental contexts. Not only are tasks 
carried out, but they are performed with care over considerable 
periods of time. 

Our observation that subjects tend to carry out a remarkably 
wide range of instructions with a surprising degree of diligence reflects 
Only one aspect of the motivation manifested by most subjects in an 
experimental situation. It is relevant to consider another aspect 
of motivation that is common to the subjects of most psycholo- 
gical experiments: high regard for the aims of science and experimen- 
tation. 

A volunteer who participates in a psychological experiment may 
do so for a wide variety of reasons ranging from the need to fulfill a 
Course requirement, to the need for money, to the unvoiced hope of 
altering his personal adjustment for the better, etc. Over and above 
these motives, however, college students tend to share (with the 
experimenter) the hope and expectation that the study in which they 
are participating will in some material way contribute to science and 


6. Tasks which would involve the use of actual severe physical pain or exhaus- 
tion were not considered. 

7. This observation is consistent with Frank’s (1944) failure to obtain resistance 
to disagreeable or nonsensical tasks. He accounts for this “primarily by S’s 
unwillingness to break the tacit agreement he had made when he volunteered 
to take part in the experiment, namely, to do whatever the experiment 
required of him” (p. 24). 
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perhaps ultimately to human welfare in general. We should expect 
that many of the characteristics of the experimental situation derive 
from the peculiar role relationship which exists between subject and 
experimenter. Both subject and experimenter share the belief that 
whatever the experimental task is, it is important, and that as such no 
matter how much effort must be exerted or how much discomfort 
must be endured, it is justified by the ultimate purpose. 

If we assume that much of the motivation of the subject to comply 
with any and all experimental instructions derives from an identification 
with the goals of science in general and the success of the experiment 
in particular,® it follows that the subject has a stake in the outcome of 
the study in which he is participating. For the volunteer subject to 
feel that he has made a useful contribution, it is necessary for him to 
assume that the experimenter is competent and that he himself is a 
“good subject.” 

The significance to the subject of successfully being a “good subject” 
is attested to by the frequent questions at the conclusion of an experi- 
ment, to the effect of, “Did I ruin the experiment?” What is most 
commonly meant by this is, “Did I perform well in my role as experi- 
mental subject?” or “Did my behavior demonstrate that which the 
experiment is designed to show?” Admittedly, subjects are concerned 
about their performance in terms of reinforcing their self-image; 
nonetheless, they seem even more concerned with the utility of their 
performances. We might well expect then that as far as the subject is 
able, he will behave in an experimental context in a manner designed 
to play the role of a “good subject” or, in other words, to validate the 
experimental hypothesis. Viewed in this way, the student volunteer is 

. not merely a passive responder in an experimental situation but rather 
he has a very real stake in the successful outcome of the experiment. 
This problem is implicitly recognized in the large number of psycho- 
logical studies which attempt to conceal the true purpose of the experi- 
ment from the subject in the hope of thereby obtaining more reliable 
data. This maneuver on the part of psychologists is so widely known in 
the college population that even if a psychologist is honest with the 
subject, more often than not he will be distrusted. As one subject 


pithily put it, “Psychologists always lie!” This bit of paranoia has 
some support in reality. 


8. This hypothesis is subject to empirical test. We should predict that there 
would be measurable differences in motivation between subjects who 
perceive a particular experiment as “significant” and those who perceive 
the experiment as “unimportant.” 
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The subject’s performance in an experiment might almost be 
conceptualized as problem-solving behavior; that is, at some level 
he sees it as his task to ascertain the true purpose of the experiment and 
respond in a manner which will support the hypotheses being tested. 
Viewed in this light, the totality of cues which convey an experimental i 
hypothesis to the subject become significant determinants of subjects? 
behavior. We have labeled the sum total of such cues as the “demand 
characteristics of the experimental situation” (Orne, 1959a). These cues 
include the rumors or campus scuttlebutt about the research, the 
information conveyed during the original solicitation, the person of 
the experimenter, and the setting of the laboratory, as well as all 
explicit and implicit communications during the experiment proper. 
A frequently overlooked, but nonetheless very significant source of 
Cues for the subject lies in the experimental procedure itself, viewed in 
the light of the subject’s previous knowledge and experience. For 
example, if a test is given twice with some intervening treatment, 
even the dullest college student is aware that some change is expected, 
particularly if the test is in some obvious way related to the treatment. 

__ The demand characteristics perceived in any particular experiment 
will vary with the sophistication, intelligence, and previous experience 
ofeach experimental subject. To the extent that the demand character- 
istics of the experiment are clear-cut, they will be perceived uniformly — 
by most experimental subjects. It is entirely possible to have an 
experimental situation with clear-cut demand characteristics for 
Psychology undergraduates which, however, does not have the same 
clear-cut demand characteristics for enlisted army personnel. It is, 
of course, those demand characteristics which are perceived by the 
subject that will influence his behavior. . ty, 

We should like to propose the heuristic assumption that a subject’s 

ehavior in any experimental situation will be determined by two sets 
of variables: (a) those which are traditionally defined as experimental 
variables and (b) the perceived demand characteristics of the experi- 
Mental situation. The extent to which the subject’s behavior is related 
to the demand characteristics, rather than to the experimental variable, 
Will in large measure determine both the extent to which the experiment 
Can be replicated with minor modification (i.e, modified demand 
Characteristics) and the extent to which generalizations can be drawn 
about the effect of the experimental variables in nonexperimental 
Contexts [the problem of ecological validity (Brunswik, 1947)]. i 

It becomes an empirical issue to study under what circumstances, in 
What kind of experimental contexts, and with what kind of subject 
Populations, demand characteristics become significant in determining 
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the behavior of subjects in experimental situations. It should be clear 
that demand characteristics cannot be eliminated from experiments; 
all experiments will have demand characteristics, and these will always 
have some effect. It does become possible, however, to study the effect 
of demand characteristics as opposed to the effect of experimental 
variables. However, techniques designed to study the effect of demand 
characteristics need to take into account that these effects result 
from the subject’s active attempt to respond appropriately to the 
totality of the experimental situation. by: 
It is perhaps best to think of the perceived demand characteristics 
as a contextual variable in the.experimental situation. We should like 
to emphasize that, at this stage, little is known about this variable. 
In our first study which utilized the demand characteristics concept 
(Orne, 1959b), we found that a particular experimental effect was 
present only in records of those subjects who were able to verbalize the 
experimenter’s hypothesis. Those subjects who were unable to do so 
did not show the predicted phenomenon. Indeed we found that 
whether or not a given subject perceived the experimenter’s hypothesis 
was a more accurate predictor of the subject’s actual performance 
than his statement about what he thought he had done on the experi- 
mental task. It became clear from extensive interviews with subjects 
that response to the demand characteristics is not merely conscious 
compliance. When we speak of“playing the role ofa good experimental 
subject,” we use the concept analogously to the way in which Sarbin 
(1950) describes role playing in hypnosis: namely, largely on a 
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opposite direction. (This is analogous to the often observed tendency 
to favor individuals whom we dislike in an effort to be fair.)° 

Delineation of the situations where demand characteristics may 
produce an effect ascribed to experimental variables, or where they 
may obscure such an effect and actually lead to systematic data in the 
opposite direction, as well as those experimental contexts where they 
do not play a major role, is an issue for further work. Recognizing the 
contribution to experimental results which may be made by the demand 
characteristics of the situation, what are some experimental techniques 
for the study of demand characteristics? 

As we have pointed out, it is futile to imagine an experiment that 
could be created without demand characteristics. One of the basic 
characteristics of the human being is that he will ascribe purpose and 
meaning even in the absence of purpose and meaning. In an experiment 
where he knows some purpose exists, it is inconceivable for him not to 
form some hypothesis as to the purpose, based on some cues, no 
matter how meager; this will then determine the demand characteristics 
which will be perceived by and operate for a particular subject. 
Rather than eliminating this variable then, it becomes necessary to take 
demand characteristics into account, study their effect, and manipulate 
them if necessary. ee, 

One procedure to determine the demand characteristics 1s the 
systematic study of each individual subject's perception of the experi- 
mental hypothesis. If one can determine what demand characteristics 
are perceived by each subject, it becomes possible to determine to 
what extent these, rather than the experimental variables, correlate 
with the observed behavior. If the subject’s behavior correlates 
better with the demand characteristics than with the experimental 
variables, it is probable that the demand characteristics are the major 
determinants of the behavior. 

The most obvious technique for determining what demand 
characteristics are perceived is the use of postexperimental inquiry. 
In this regard, it is well to point out that considerable self-discipline is 
necessary for the experimenter to obtain a valid inquiry. A great 
Many experimenters at least implicitly make the demand that the 


9. Rosenthal (1961) in his recent work on experimenter bias, has reported a 
similar type of phenomenon. Biasing was maximized by ego involvement 
of the experimenters, but when an attempt was made to increase biasing 
by paying for “good results,” there was a marked reduction of effect. This 
reversal may be ascribed to the experimenters’ becoming too aware of their 
own wishes in the situation. 
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subject not perceive what is really going on. The temptation for Da 
experimenter, in, say, a replication of an Asch group pressure ST 
ment, is to ask the subject afterwards, “You didn’t realize that the po 
fellows were confederates, did you?” Having obtained the le 
“No,” the experimenter breathes a sigh of relief and neither subjec aad 
experimenter pursues the issue further.!° However, even if the erpa 3 
menter makes an effort to elicit the subject’s perception of the : Yi 
pothesis of the experiment, he may have difficulty in obtaining a vali 


report because the subject as well as he himself has considerable 
interest in appearing naive. 


Most subjects are co 
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ningful. For this reason, it Is 

rance resulting from the inter- 
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Create a situation where the particular subject’s performance needs to be 
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10. Asch (1952) himself took great pains to avoid this pitfall, 
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A procedure which has been independently advocated by Riecken 
(1958) and Orne (1959a) is designed to deal with the first of these 
objections. This consists of an inquiry procedure which is conducted 
much as though the subject had actually been run in the experiment, 
without, however, permitting him to be given any experimental data. 
Instead, the precise procedure of the experiment is explained, the 
experimental material is shown to the subject, and he is told what he 
would be required to do; however, he is not permitted to make any 
responses. He is then given a postexperimental inquiry as though he 
had been a subject. Thus, one would say, “If I had asked you to do all 
these things, what do you think that the experiment would be about, 
what do you think I would be trying to prove, what would my hypothesis 
be?” etc. This technique, which we have termed the pre-experimental 
inquiry, can be extended very readily to the giving of pre-experimental 
tests, followed by the explanation of experimental conditions and 
tasks, and the administration of postexperimental tests. The subject is 
requested to behave on these tests as though he had been exposed 
to the experimental treatment that was described to him. This type of 
Procedure is not open to the objection that the subject’s own behavior 
has provided cues for him as to the purpose of the task. It presents him 
with a straight problem-solving situation and makes explicit what, 
for the true experimental subject, is implicit. It goes without saying that 
these subjects who are run on the pre-experimental inquiry conditions 
must be drawn from the same population as the experimental groups 
and may, of course, not be run subsequently in the experimental 
condition. This technique is one of approximation rather than of 
Proof. However, if subjects describe behavior on the pre-inquiry 
conditions as similar to, or identical with, that actually given by subjects 
exposed to the experimental conditions, the hypothesis becomes 
plausible that demand characteristics may be responsible for the 
behavior. 

It is clear that pre- and postexperimental inquiry techniques have 
their own demand characteristics. For these reasons, it is usually best to 
have the inquiry conducted by an experimenter who is not acquainted 
with the actual experimental behavior of the subjects. This will tend to 
minimize the effect of experimenter bias. _ 

Another technique which we have utilized for approximating 
the effect of the demand characteristics is to attempt to hold the demand 
characteristics constant and eliminate the experimental variable. One 
way of accomplishing this purpose is through the use of simulating 
Subjects. This is a group of subjects who are not exposed to the 
experimental variable to which the effect has been attributed, but who 
are instructed to act as if this were the case. In order to control for 
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experimenter bias under these circumstances, it is advisable to utilize 
more than one experimenter and to have the experimenter who actually 
runs the subjects “blind” as to which group (simulating or real) any 
given individual belongs. 

Our work in hypnosis (Damaser, Shor, and Orne, 1963; Orne, 
1959b; Shor, 1959) is a good example of the use of simulating controls. 
Subjects unable to enter hypnosis are instructed to simulate entering 
hypnosis for another experimenter. The experimenter who runs the 
study sees both highly trained hypnotic subjects and simulators in 
random order and does not know to which group each subject belongs. 
Because the subjects are run “blind,” the experimenter is more likely 


to treat the two groups of subjects identically. We have found that 
simulating subjects are able to 
deceiving even well 


is not exposed to the experimenta 


-solving task: namely, to utilize 
ble by the experimental context and the 
Avior in order to behave as they think that 
I - Therefore, to the extent imulatin 

subjects are able to behav ace $ 


are e identically, it is possible that demand 
characteristics, rather than the altered State P 
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patient is not, despite precautions to the contrary; i.e., the patient is 
Cognizant that he does not have the side effects which some of his 
fellow patients on the ward experience. By the same token, in psy- 
chological placebo treatments, it is equally important to ascertain 
whether the subject actually perceived the treatment to be experimental 
or control. Certainly the subject’s perception of himself as a control 
subject may materially alter the situation. 

A recent experiment (Orne and Scheibe, 1964) in our laboratory 
illustrates this type of investigation. We were interested in studying the 
demand characteristics of sensory deprivation experiments, indepen- 
dent of any actual sensory deprivation. We hypothesized that the 
Overly cautious treatment of subjects, careful screening for mental or 
physical disorders, awesome release forms, and, above all, the presence 
of a “panic (release) button” might be more significant in producing the 
effects reported from sensory deprivation than the actual diminution of 
sensory input. A pilot study (Stare, Brown, and Orne, 1959), employing 
pre-inquiry techniques, supported this view. Recently, we designed an 
experiment to test more rigorously this hypothesis. 

This experiment, which we called Meaning Deprivation, had all 
the accoutrements of sensory deprivation, including release forms and a 
red panic button. However, we carefully refrained from creating any 
Sensory deprivation whatsoever. The experimental task consisted of 
Sitting in a small experimental room which was well lighted, with two 
comfortable chairs, as well as ice water and a sandwich, and an optional 
task of adding numbers. The subject did not have a watch during this 
time, the room was reasonably quiet, but not soundproof, and the 
duration of the experiment (of which the subject was ignorant) was 
four hours, Before the subject was placed in the experimental room, 
10 tests previously used in sensory deprivation research were adminis- 
tered. At the completion of the experiment, the same tasks were again 
administered. A microphone and a one-way screen were present in 
the room, and the subject was encouraged to verbalize freely. 

The control group of 10 subjects was subjected to the identical 
treatment, except that they were told that they were control subjects 
fora sensory deprivation experiment. The panic button was eliminated 
for this group. The formal experimental treatment of these two groups 
of subjects was the same in terms of the objective stress—four hours 
Ofisolation. However, the demand characteristics had been purposively 
varied for the two groups to study the effect of demand characteristics 
as opposed to objective stress. Of the 14 measures which could be 
quantified, 13 were in the predicted direction, and 6 were significant 
at the selected 10% alpha level or better. A Mann-Whitney U test has 
been performed on the summation ranks of all measures as a conven- 
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ient method for summarizing the overall differences. The oneal 
probability which emerges is p = .001, a clear demonstration 
effects. ; 
eee study suggests that demand characteristics may in part 
account for some of the findings commonly attributed to senson 
deprivation. We have found similar significant effects of deman 
characteristics in accounting for a great deal of the findings reported i 
hypnosis. It is highly probable that careful attention to this vane 
or group of variables, may resolve some of the current controversi 
regarding a number of psychological phenomena in motivation, 
learning, and perception. ; 
In summary, we have suggested that the subject must be recognized 
as an active participant in any experiment, and that it may be fruitful 
to view the psychological experiment as a very special form of social 
interaction. We have Proposed that the subject’s behavior in an experi- 
ment is a function of the totality of the situation, which includes the 


experimental variables being investigated and at least one other set 
of variables which we hav. 


characteristics of the experi 


demand characteristics are not simply matters of good experimental 
technique; rather, it is an 


empirical issue to determine under what 
circumstances demand ch 
experimental behavior. 
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experiments may in some cases permit the generalization of research 
findings at least to psychology students enrolled in certain courses. 
In many cases, however, even this generalization may be unwarranted. 
Frequently the psychology student, while required to serve as S in 
psychological research, has a choice of which experiment to participate 
in. Do brighter (or duller) students sign up for learning experiments or 
at least for experiments that are labelled ‘learning’? Do better (or more 
poorly) adjusted students sign up for experiments labelled as per- 
sonality experiments? Do better (or more poorly) coordinated 
students sign up for motor skills studies? The answers to these types 
of question and, more importantly, whether they make a difference, 
are also empirical matters. 

Psychologists have concerned themselves a good deal with the 
problem of the volunteer S. Evidence for this concern will be found in 
the following pages, where we will find a fair number of attempts to 
learn something of the act of volunteering and of the differences between 
volunteers and nonvolunteers. Further evidence for this concern can 
be found, too, in the frequent statements made with pride in the 
psychological literature of recent vintage that ‘the subjects employed 
in this experiment were nonvolunteers’. The discipline of mathematical 
Statistics, that good consultant to the discipline of psychology, has 
concerned itself with the volunteer problem (e.g., Cochran, Mosteller, 
and Tukey, 1953). Evidence for this concern can be found in the fact 
that we now know a good bit about the implications for statistical 
Procedures and inference of having drawn a sample of volunteers 
(Bell, 1961). Generally, the concern over the volunteer problem has had 
as its goal the reduction of the non-representativeness of volunteer 
samples in order to increase the generality of research findings (e.g. 
Locke, 1954; Hyman and Sheatsley, 1954). The magnitude of the 
Potential biasing effect of volunteer samples is clearly illustrated in a 
report? that, at a large university, rates of volunteering varied from 
10 to 100 percent. Within the same course, different recruiters going 
to different sections of the course obtained volunteering rates anywhere 
from 50 to 100 percent. y s 

Our special purposes here will be first to organize and conceptualize 
whatever may be substantively known about the act of volunteering 
and the more enduring personal attributes of volunteers compared 
with nonvolunteers. Subsequently we shall examine the implications 
of our analysis for the representativeness of research findings and for the 


Possible effects on experimental outcomes. 


2. John R. P. French, Jr., personal communication, 19 August 1963. 
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The act of volunteering 
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of which might be enough to prevent someone from volunteering for 
either surveys or experiments... The data-provider, on the other hand, 
can, and often does, negatively evaluate the data-collector. He can 
call him, his task, or his questionnaire inept, stupid, banal, and the 
like, but hardly with any great feeling of confidence that this evaluation 
is really accurate; the data-collector, after all, has a plan for the use 
of his data, and the subject or respondent usually does not know this 
plan, though he is aware that a plan exists. He is, therefore, in a poor 
position to evaluate the data-collector’s performance, and he is likely 
to know it. 

Riecken (1962) has postulated that one of an experimental S's 
major aims in the experimental interaction is to ‘put his best foot 
forward’, It follows from this and from what we have said earlier that, 
in both surveys and experiments, prospective respondents or S's 
are more likely to volunteer or respond when there is an increase in 
their subjective probability of being evaluated more favorably. And so 
it seems to be—a finding based primarily upon studies of nonrespon- 
ders in survey research studies. Edgerton, Britt, and Norman (1947) 
found that winners of contests responded most helpfully to follow-up 
questionnaires, whereas losers responded least. Their interpretation of 
greater interest in the subject on the part of the winner group does not 
by itself seem entirely convincing. Contest winners have the secure 
assurance that they did win, and they are being asked to respond, very 
likely, in their winner role. Contest losers, on the other hand, have the 
less happy assurance that they were losers and perhaps would be 
further evaluated as such. These same authors (1947) convincingly 
demonstrated the consistency of their results by summarizing work 
which showed, for example, that: parents of delinquent boys were 
more likely to answer questionnaires about them if they had nicer 
things to say about them; college professors holding minor and tem- 
porary appointments were less likely to reply usefully to questionnaires; 
teachers who had no radios replied less promptly to questionnaires 
about the use of radios in the classroom than did teachers who had 
radios; patrons of commercial airlines more promptly returned 
questionnaires about airline usage than did nonpatrons; college 
graduates replied more often and more promptly to college follow-up 
questionnaires than did drop-outs. Norman (1948) cited evidence that 
technical and science graduates replied less promptly to questionnaires 
if they were unemployed, or employed outside the field in which 
they had been trained. Locke (1954) found married respondents 
more willing to be interviewed about marital adjustment than divorced 
respondents. None of these findings argues against the interest 

hypothesis advanced by Edgerton, Britt, and Norman (1947), and, 
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evaluated by the investigator. It would seem trite but necessary to add 
that this formulation Tequires 


Characteristics of volunteers 


Sex. In two survey research projects and in a social psychological 
eiperiment, Belson (1960), Wall. (1949), and Schachter and Hall 
mo all found no difference between men and women in their 
wi 
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and the appearing student as volunteerlike in finding his way into an 
experiment, these workers confirmed Frey and Becker’s finding of 
no sex difference associated with volunteering or nonvolunteering. A 
sad footnote to these results is provided by the fact that about half 
of Frey and Becker’s no-shows notified their E’s of their forthcoming 
absence whereas only one of Leipold and James's 39 no-shows so 
demeaned himself. 

London (1961), too, found almost identical rates of simple willing- 
ness to participate in a psychological experiment among men and 
women, this time in an experiment involving hypnosis. He did find, 
however, that among those who said they were ‘very eager’ to participate 
there were many more men than women. More men than women, too, 
are willing to volunteer for electric shocks (Howe, 1960), and for 
Kinsey-type interviews dealing with sex attitudes or behavior (Siegman, 
1956: Martin and Marcuse, 1958). London interpreted his finding for 
the hypnosis research situation as a reflection of the girls’ greater 
fear of loss of control. A more parsimonious interpretation, which 
takes London’s finding into account as well as the findings of Howe, 
Siegman, and Martin and Marcuse, may be that being ‘very eager’ 
to be hypnotized and being willing to be Kinsey-interviewed or 
electrically shocked are indications of a somewhat generalized un- 
conventionality associated culturally with males for certain types of 
situation. It has been shown, and will be discussed in detail later, that 
More unconventional students do, in fact, volunteer more for psycho- 
logical research, defining unconyentionality in a variety of ways. 
For more run-of-the-mine type experiments, this sex-linked unconven- 
tionality would not be so relevant to volunteering, an interpretation 
which is not too inconsistent with findings by Himelstein (1956) and 
Schubert (1960). Both these workers found that, for experiments 
unspecified for their S’s, females volunteered significantly more than 
males, 

_ The likelihood of sex by experimental-situation interaction effects 
is maintained by a further finding of Martin and Marcuse (1958). 
To requests for volunteers for experiments in learning, personality, and 
hypnosis, girls tend to respond more in each case, although none of the 
differences could be judged statistically significant. It should be added 
that these authors did not ask their potential hypnosis S's whether 
they were “very eager’. 


Finally, we need to remind ourselves of Coffin’s (1941) caution that 


any obtained sex differences in behavioral research may bea function 
Of the sex of E. Thus we may wonder, along with Coffin and Martin 
and Marcuse, about (i) the differential effects on volunteer rates 
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among male and female S’s of being confronted with a male vs. a 
female Kinsey interviewer ; and (ii) the differential effects on eagerness 
to be hypnotized of being confronted with a male vs. a female hypnotist. 

Related to the sex effect on differential volunteering rates are the 
findings of Rosen (1951) and Schubert (1960). Both workers found males 


who showed greater femininity of interests to be more likely to volunteer 
for psychological experiments. 
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rather to be determined -nonvolunteers. Interpretation of these 
findings is rendered difficult, however, by (i) the fact that theirs was 
not meant to be a direct study of volunteer characteristics, and (ii) 
the extremeness of all their subjects’ scores on a sociability-relevant 
variable: introversion-extraversion. 


Anxiety. Our understanding of the relationship between Ss’ anxiety 
and the volunteering response suffers not so much from lack of data 
as from lack of consistency. Some studies reveal volunteers to be 
more anxious than nonvolunteers, some suggest that they are less 
anxious, and still others find no differences in this respect. 

There were no apparent systematic differences in the types of 
experiment for which participation was requested between those 
workers who found volunteers more anxious and those who found 
them less anxious. 

Scheier (1959), utilizing the IPAT questionnaires, found volun- 
teers to be less anxious than nonvolunteers; and Himelstein (1956), 
employing the TMAS, found a trend in the same direction, though 
his differences were not judged statistically significant. Heilizer (1960), 
Howe (1960), and Siegman (1956) found no statistically significant 
differences in the levels of TMAS anxiety of volunteer and nonvolunteer 
S's. Both Rosen (1951) and Schubert (1960), however, found volunteers 
scoring higher on anxiety as defined by the MMPI-Pt scale. It might be 
tempting to summarize these findings by saying that differences in 
anxiety level between volunteers and nonvolunteers cancel out to no 
difference at all. Though this might simplify our interpretation, it 
would be a little like taking two relatively unlikely events on opposite 
ends of a continuum (say feast and famine or drought and flood) and 
Concluding that two usual events had occurred. Two other studies are 
relevant here and may provide a clue to the interpretation of the 
Mconsistent findings reported. f 

Leipold and James (1962) compared those S's who failed to appear 


Or an experiment with those who a on i e 
(TMAS) separately for each sex, after finding a significant interaction 
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ad been requested. Their volunteers were higher in anxiety than their 


26 Research Problems in Psychology 


nonvolunteers for an experiment in personality. No dieren ga 
(that held for both male and female subjects) were foun l e He 
volunteers and nonvolunteers for experiments in hypnosis, sat 
or attitudes about sex. However, males volunteering for a 1yp ae 
experiment were less anxious than male nonvolunteers. This di Bile : 
was not found for female S's. The Martin and Marcuse es ae 
providing further contradictory evidence, suggest that future e 
of anxiety may resolve these contradictory findings by consi ae 
the likelihood of significant interaction effects of sex of S and type i 
experiment for which volunteers are solicited upon the differences 
anxiety level obtained between volunteers and nonvolunteers. 
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Personal Preference Schedule. However, Frye and Adams (1959), 
employing the same instrument, found no personality difference 
between (male and female) volunteers and nonvolunteers for an experi- 
ment in social psychology. At least for the variety of measures of 
‘conformity’ employed, no general conclusions about their relationship 
to the volunteering response seem warranted. 


Age. Participants in survey research studies tend to be younger than 
nonparticipants, as shown by Wallin (1949). In addition, earlier 
compliers with a request to complete a questionnaire tend to be 
younger than later compliers (Abeles, Iscoe, and Brown, 1954-55). 
Volunteers for personality research were found to be younger than 
nonvolunteers by Newman (1957), who also demonstrated that, at 
least among females, variability of age of volunteers is a function of the 
type of experiment for which participation is solicited. Rosen (1951) 
found younger females to volunteer more than older females, but no 
such relationship held for male S's. We may have somewhat greater 
confidence in summarizing the relationship of age to volunteering 
than is warranted for some of the other variables discussed in this 
section, Volunteers tend to be younger than nonvolunteers, especially 


among female S's. 


not only intelligence-test differences 


between volunteers and nonvolunteers, but comparisons on the related 
variables of motor skill, grades, education, and serious-mindedness as 
well. Martin and Marcuse (1957, 1958) reported that their volunteers 
earned higher scores on a standard test of intelligence (ACE), a finding 
supported by Edgerton, Britt, and Norman (1947) in their review of 
several survey research studies. Brower’s (1948) data showed volunteers 
to perform a difficult motor task with greater speed and fewer errors 
than a group of S’s who were forced to participate. For simpler motor 
tasks, performance differences were less clear cut. Leipold and James 
(1962) found that those S’s who showed up for an experiment to which 
they had been assigned were earning somewhat higher grades in 
introductory psychology than those who did not show up. The trend 
they found was greater for female than for male S's. Similarly, Abeles, 


Iscoe, and Brown (1954-55) reported higher grades earned by those 
who complied more promptly with a request to participate ina question- 
naire study. Rosen (1951), on the other hand, did not find a difference 
1n respect of grades earned between volunteers and nonvolunteers for 
Psychological experiments. He did find, however, that female volun- 
teers were more serious-minded than female nonvolunteers. If we 


May consider not belonging to a fraternity as a mark of serious- 


Intelligence. Here we shall include 
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experiments to be more unconventional by this definition. These 
same workers further found volunteers to score higher on the F scale of 
the MMPI, which reflects a willingness to admit to unconventional 
experiences, The Lie scale of the MMPI taps primness and propriety, 
and high scorers may be regarded as more conventional than low 
scorers. Although Heilizer (1960) found no Lie-scale differences 
between volunteers and nonvolunteers, Schubert (1960) found volun- 
teers to score lower. 

It seems to be generally true that volunteers for a variety of 
Psychological studies tend to be more unconventional than their 
Nonvolunteering counterparts. Of a dozen relevant bits of evidence, 
ten are consistent with this formulation and only two are not. These 
two inconsistent bits of evidence seem less weakening of our conclusion 
by virtue of the fact that they find no difference between volunteers 
and nonvolunteers in this respect, rather than differences in the opposite 
direction. In general, these findings have been found to occur with both 
male and female S’s. There is, however, some evidence which should 
caution us to look for possible interaction effects of sex of S with the 
relationship of conventionality to volunteering. London et al. (1961) 
concluded that, at least for hypnosis experiments, girls who volunteer 
May be significantly more interested in the novel and the unusual, 
Whereas for boys this relationship seems less likely. Under the heading 
of conformity we have already mentioned a finding that may bear out 
London et al. This was Fosier’s (1961) finding that the relationship 
between conformity and volunteering was in the opposite direction for 
boys as compared with girls. Although his finding did not reach 
Statistical significance, it has theoretical significance for us here when 
viewed in the light of the results of London et al. 


Arousal-seeking. Schubert (1960) postulated that volunteering is a 
function ofa trait he called arousal-seeking. Evidence for this relation- 
ship came from the fact that his volunteers for a ‘psychological 
experiment’ reported greater coffee-drinking and caffeine pill-taking 
1N Comparison with nonvolunteers, as well as from differences in scores 
tween volunteers and nonvolunteers on various scales of the 
PI which might be considered consistent with his hypothesis. 
Those MMPI characteristics he found associated with greater 
volunteering were generally also found by London et al. (1961), with 
one important exception. The Hypomanic (Ma) scale of the MMPI 
Was found by Schubert to correlate positively with volunteering. Since 
e implied hyperactivity of high Ma scorers is consistent with the 
ait of arousal-secking, this finding strengthened Schubert's hypothesis. 
Ondon et al, however, found a negative relationship between the 


tr 
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London et al. (1961) concluded that those S’s who volunteered 
for hypnosis experiments in order to serve science (rather than for 
novelty) were a psychologically more stable or ‘upright’ group as 
defined by 16 Pf. Rosen (1951) found his volunteers to be more psycho- 
logically-minded (e.g., F-scale scores) and to admit more readily to 
feelings of anxiety and inadequacy (MMPI). It is difficult to decide 
whether we should therefore consider volunteers better or worse 
adjusted than nonvolunteers. In clinical lore these feelings are osten- 
sibly ‘bad’ to have, but, on the other hand, knowing you have them 
gives you an adjustive edge. ae 

In the area of medical research, Richards (1960) found projective 
test differences between those who volunteered to take Mescaline and 
those who did not. The investigator could not determine, however, 
which group was the better adjusted. In a sample drawn by Lasagna 
and von Felsinger (1954), volunteers for medical research seemed to be 
relatively poorly adjusted. For the area of medical research, finally, 
Pollin and Perlin (1958) and Perlin, Pollin, and Butler (1958) concluded 
that the greater the intrinsic motivation of a subject to volunteer for the 
role of normal control, the greater the likelihood of psychopathology. 
This does seem to be the clearest evidence of a relationship between 
volunteering and psychopathology we have yet encountered. Two 
Circumstances should be taken into account, however, 1n evaluating 
this finding. First is the notorious unreliability of psychiatric diagnosis, 
and second is the fact that the medical research studies included 
in our discussion may differ qualitatively from the more usual psycho- 
gical studies we have been considering. 

On the basis of the evidence presented we may propose that the 
Nature of the relationship between volunteering and adjustment, while 
essentially unknown, may be a function of the task for which volunteers 
are solicited, and may be differentially curvilinear as a function of 
Sex of S. 

Sociological Variables. Insufficient data are available to warrant much 
iscussion of such variables as social and economic status, religious 
athliation, marital status, and regional factors in the determination of 
Volunteering behavior. Although Belson (1960) reported higher social 
c ass S’s to volunteer more, Rosen (1951) found the opposite seme 
paip among his female college S’s. Rosen found few other sociologica: 


variables to make much difference. Wallin’s (1949) data are in partial 
agreement except that for his survey situation he did find religious 
ç fliation a relevant predictor variable. For a Kinsey-type ee 
ce eman (1956) reported higher volunteering rates in an easter 
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; » Personality, percept; nd pain. 
Unfortunately very few studies have employe foe fee task for 
which to solicit volunteers, eto 


ting and the persona 
57) did employ more than 
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one task in his study. His S’s were asked to volunteer for both a 
personality and a perception experiment, but he found no systematic 
effect of these two tasks on the relationships between the variables he 
investigated and the act of volunteering. Martin and Marcuse (1958) 
employed four tasks for which volunteering was requested. They 
found greater differences between volunteers and nonvolunteers for 
their hypnosis experiment than were found between the two groups 
for experiments in learning, attitudes to sex, and personality. Of these 
last three experimental situations, the personality study situation tended 
to reveal somewhat more personality differences between volunteers 
. and nonvolunteers than were found in the other two situations. Those 
differences that did emerge from the more differentiating tasks did not 
seem to be particularly related conceptually to the differential nature 
of the tasks for which volunteering had been requested. These findings 
should warn us, however, that any of the characteristics of volunteers 
we have discussed may be a function of the particular situation for 
Which volunteering had been requested. ana 
Since it would be desirable to be able to speak about characteristics 
of volunteers for a ‘generalized’ psychological experiment, a special 
effort was made to find studies wherein the request for volunteers 
was quite nonspecific. Several of the studies discussed met this specifi- 
cation (e.g, Himelstein, 1956; Leipold and James, 1962; Schubert, 
1960). In these studies, requests were simply for participation in an 
Unspecified psychological experiment. Comparison of the character- 
istics of volunteers for this more general situation with differentiating 
“characteristics obtained for other task requests again revealed no 
Systematic differences. 


Attribute Samples. We have discussed virtually all the attributes of 
Volunteers for psychological experiments which differentiate them 
Tom nonvolunteers of which we are aware. For organizational and 
Curistic purposes, however, we have grouped these together under a 
Smaller number of headings. Decisions to group any variables under 
3 ee heading were made on the basis of empirically established 
nd/or conceptually meaningful relationships. : 
It should be kurthet Ba that, within any heading, such as 
anxiety, several different operational definitions may have been sm 
Ployed. Thus we have discussed anxiety as defined by the Taylor 


anifest Anxiety Scale as well as by the Pt scale of the MMT 
ntelligence has been defined by several tests of intellectual al no 
MS Practice has been necessary to our discussion in view of the yer 

number of studies employing identical operational definitions of any 


Variables excepting age, birth, order, and sex. This necessity, however, 
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i in spi i ional 
is not unmixed with virtue. If, in spite of differences of operatio 
definition, the variab| 


E F n 
es serve to predict the act of volunteering, vena 
feel greater confidence in the construct underlying the varying A A 
tions and in its relevance to the predictive and conceptual tas 
hand. 


Summary of 
volunteer characteristics 


I. Statements Warranting Some Confidence 


] manifest greater intellectual ability, intellectual 
Interest, and Intellectual Motivation, 


manifest great 


er need for social approval. 
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ll. Vi 
a cage more often to be moderately well or poorl 
a when male, and better or worse than moderatel i 
justed when female. aaa 


12. olunteers ter o be 
2 V nt tend to be less c li E W 

> onformin hen male but more 
conforming when female. i 


ee variables listed need further investigation, particu- 
ealta te = Se group. The situation would seem especially 
e of factor-analytic and multiple-regression techniques. 
Deena a y troublesome question arises from examination of 
tee e consa a of the variables listed in Group I. There 
ater G unteers tend to be more unconventional than non- 
of Crowne tide manifest a greater need for approval. On the basis 
Kuito 1) discussion of the need for approval variable, this 
nexpected and nomologically nettling. 


mohn 
ape cations of volunteer 
laracteristics 


For Rentocontnry 
ar pe Ht ace One conclusion seems eminently tenable from 
very good 3 z poy given psychological experiment the chances are 
from the un eed that a sample of volunteer S’s will differ appreciably 
implications mit aie nonvolunteers. Let us examine some of the 
limitation o this conclusion. One that is rather well known is the 
by the a mee on subsequent statistical procedures and inference 
iF discussed ra a the requirement of random sampling. This problem 
some of the wien texts in sampling theory and is mentioned by 
Mostelle workers we have had occasion to cite earlier (Cochran, 
Groat and Tukey, 1953). 
tion arpa volunteers are never a random sample of the popula- 
sample of as they were recruited, and further granting that a given 
tOna smole ofiera differs on a number of important dimensions 
Status actual of nonvolunteers, we still do not know whether volunteer 
a given ex = makes a difference or not. It is entirely possible that in 
a sop the performance of the volunteer S’s would not 
ese had rom the performance of the unsampled nonvolunteers if 
actually been recruited for the experiment (Lasagna and von 


elsin. ea 
ger, 1954). The point is that substantively we have little idea of the 
s of investigations, 


hich volunteers are 
junteers are actually used, in 
he use of volunteers actually 
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makes a difference, what kind of difference, and how much pee 
difference. Once we know something about these questions, von 
enjoy the convenience of volunteer $’s with better scientific ee fe 
In the meantime the best we can do is to hypothesize what the effec 
of volunteer characteristics might be on any given line of inquiry. ch 
As an example of this kind of hypothesizing we can take the mu 


analyzed Kinsey-type study of sexual behavior. We have already seen 
how volunteers for thi 


attitudes about sexual sated 
unconventional ways. This tendency, as has been frequently pointe 
out, may have had g 

studies, leadi 
biased in th 
bias could p 
students amon 
‘volunteers’ in 


tor’s approval from changing their status of 
nonvolunteer to volunteer. 


Far fewer data are available for most 


other areas of psychological 
inquiry. Greene (1937 


) showed that precision in discrimination ier 
was related to the nature of $’s type of personal adjustment and to hi 


intelligence. Since volunteers may differ from nonvolunteers ee 
adjustment and, even more likely, in intelligence, experiments utilizing 
discrimination tasks might well be 


ormative sample volunteered for the 
nsistent finding that volunteers are 
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envision an affirmative response.* Consider an experiment to test the 
effects of an independent variable on gregariousness. If volunteers are 
indeed more sociable than nonvolunteers, the untreated control S’s 
may show a high enough level of gregariousness to result in the treat- 
ment’s being adjudged ineffective when, with a less restricted range of 
sociability of S’s, the treatment might well have been judged as leading 
to a statistically significant difference between the means. To cite an 
example leading to the opposite type of error, let us consider an experi- 
ment using female S’s in which some dependent variable is observed 
as a function of good and poor psychological adjustment. If female 
volunteers are more variable than nonvolunteers on the dimension of 
adjustment, comparing S’s in the top and bottom 27 percent for adjust- 
ment level on the dependent variable might lead to a greater ‘treatment’ 
effect than would have been obtained with a sample of nonvolunteers. 


General summary 
and conclusions 


To McNemar’s statement that ours is a science of sophomores, we 
have added the question of whether we might not lack even this degree 
of generality in our science. The volunteer status of many who serve 
as S’s in psychological research is a fact of life to be reckoned with. 
Our purpose here has been to organize and conceptualize our sub- 
stantive knowledge about the act of volunteering and the more stable 
characteristics of those more likely to find their way into the role of S 
in psychological research. 

The act of volunteering was viewed as a nonrandom event, 
determined in part by more general situational variables and in 
part by more specific personal attributes of the person asked to 
participate in psychological research as S. More general situational 
variables postulated as increasing the likelihood of volunteering 
responses included the following : 


Having only a relatively less attractive alternative to volunteering. 
Increasing the intensity of the request to volunteer. 

Increasing the perception that others in a similar situation would 
volunteer. 


ene 


3. Dittes’s (1961) finding that lessened acceptance by peers affected first-borns’ 
but not later-borns’ behavior is a most relevant example to the extent that 
we can be sure that first-borns find their way into group experiments 


reliably more often than later-borns. 
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4 . A È 3 ige of, and 
4. Increasing acquaintanceship with, the perceived prestige 
- liking for, the experimenter. 


Having greater intri 
investigated. 


6. Increasing the subjective 
ably evaluated 


: 7 : being 
5. nsic interest in the subject-matter 


A or- 
probability of subsequently being Ou 
or not unfavorably evaluated by the experi 
Prim 

ati 


nt 
arily on the basis of studies conducted with college rs 
Populations in a variety of experimental situations, it was pos reater 
that those personal attributes likely to be associated with a g 
degree of volunteering included the following: 


1. Greater intellectual ability, interest, and motivation. 
Greater unconventionality. 

Lower age. 

. Less authoritarianism. 

. Greater need for social approval. 

6. Greater sociability. 

Personal attributes investigated but resulting in only equivocal 

relation 


nAPWNnN 


anxiety, adjustment, eonfonnt 
For all the personal attribu 


icularly for those related more equivocally to the 
likelihood of volunteeri irecti 


References 


Abeles, N., Iscoe, I., and Brown, W, 


c ). Some factors influencing the 
random sampling of college students. Publ. Opin. Quart, 18, 419-493, 
Bell, C. R. (1961). Psychological versus Sociological variables in studies of 


volunteer bias in surveys. J. appl. Psychol. 45, 80-85. 


Belson, W. A. (1960). Volunteer bias in test room groups. Publ. Opin. Quart. 24, 
115-126. 


= 


R. Rosenthal 39 


Blake, R. R., Berkowitz, H., Bellamy, R. Q., and Mouton, Jane S. (1956). Volun- 
teering as an avoidance act. J. abnorm. soc. Psychol. 53, 154-156. 

Brower, D. (1948). The role of incentive in psychological research. J. gen. 
Psychol. 39, 145-147. 

Brunswik, E. (1956). Perception and the representative design of psychological 
experiments. Berkeley, Calif.: University of California Press. 

Capra, P. C. and Dittes, J. E. (1962). Birth order as a selective factor among 
volunteer subjects. J. abnorm. soc. Psychol. 64, 302. 

Cochran, W. G., Mosteller, F., and Tukey, J. W. (1953). Statistical problems of 
the Kinsey report. J: Amer. Statist. Assoc. 48, 673-716. 

Coffin, T. E. (1941). Some conditions of suggestion and suggestibility. Psychol. 
Monogr. 53, No. 4 (Whole No. 241). 

Crowne, D. P. (1961). The motive for approval: studies in the dynamics of 
influencibility and stereotypičl self-acceptability. Unpublished manuscript, 
Ohio State University. 

Dittes, J. E. (1961). Birth order and vulnerability to differences in acceptance. 
Amer. Psychologist 16, 358 (abstract). 

Edgerton, H. A, Britt, S. H., and Norman, R. D. (1947). Objective differences 
among various types of respondents to a mailed questionnaire. Amer. Sociol. 
Rev. 4, 434-444. 

Foster, R. J. (1961). Acquiescent response set as a measure of acquiescence. 
J. abnorm. soc. Psychol. 63, 155-160. 

Frey, A. H. and Becker, W. C. (1958). Some personality correlates of subjects who 
fail to appear for experimental appointments. J. consult. Psychol. 22, 164. 

Frye, R. L. and Adams, H. E. (1959). Effect of the volunteer variable on leaderless 
group discussion experiments. Psychol. Rep. 5, 184. 

Greene, E. B. (1937). Abnormal adjustments to experimental situations. Psychol. 
Bull. 34, 747-748 (abstract). 

Heilizer, F. (1960). An exploration of the relationship between hypnotizability 
and anxiety and/or neuroticism. J. consult. Psychol. 24, 432-436. 

Himelstein, P. (1956). Taylor scale characteristics ofvolunteers and nonvolunteers 
for psychological experiments. J. abnorm. soc. Psychol. 52, 138-139. 

Howe, E. S. (1960), Quantitative motivational differences between volunteers and 
nonvolunteers for a psychological experiment. J. appl. Psychol. 44, 115-120. 
Hyman, H. and Sheatsley, P. B. (1954). The scientific method. In D. P. Geddes 
(Ed.), An analysis of the Kinsey reports. New York: New American Library. 
Pp. 93-118. 

Lasagna, L. and von Felsinger, J. M. (1954). Thi 
Science 120, 359-361. 


e volunteer subject in research. 


40 Research Problems in Psychology 


Leipold, W. D. and James, R. L. (1962). Characteristics of shows and no-shows in 
a psychological experiment. Psychol. Rep. 11, 171-174. 


Locke, H. J. (1954). Are volun 
143-146. 


London, P. (1961). Subject characteristics in hypnosis research: Part I. A survey 
of experience, interest, 


and opinion. Int. J. clin. exper. Hypnosis 9, 151-161. 
London, P., Cooper, L. M., and Johnson, H. J. (1961). Subject SECS m 
hypnosis research. IT: Attitudes towards hypnosis, volunteer status, and eee 
ality measures. III: Some correlates of hypnotic susceptibility. Unpublishe 
manuscript, University of Illinois. 

Lubin, B., Levitt, E. E., an 
between responders and 
Psychol. 26, 192. 


teer interviewees representative? Soc. Probl. 1, 


d Zuckerman, M. (1962). Some personality differences 
nonresponders to a survey questionnaire. J. consult. 


McNemar, Q. (1946). Opinion-attitude methodology. Psychol. Bull. 43, 289-374. 
Marlowe, D. and Crowne, D. P. (1961). 


Social desirability and response to 
perceived situational demands. J. consult. 


Psychol. 25, 109-115. 
Martin, R. M. and Marcuse, F. L. (1957). 
nonvolunteers for hypnosis. J. clin. exper. Hy, 
Martin, R. M. and Marcuse, F, L. (1958). Characteristics of volunteers and non- 
volunteers in psychological experimentation. 


J. consult. Psychol. 22, 475-479. 
Maslow, A. H. (1942). Self-esteem (dominance feelings) and sexuality in women. 
J. soc. Psychol. 16, 259-293. 


Characteristics of volunteers and 
Pnosis 5, 176-180, 


Maslow, A. H, and Sakoda, J. M. (1952), Volunteer error in the Kinsey study. 
J. abnorm. soc. Psychol. 47, 259-262. 


Newman, M. (1957), 
volunteers for Psycholo 
nonvolunteers for resea 
(abstract), 


Personality differen 
gical investigation : 
rch in Personality an 


ces between volunteers and non- 
self-actualization of volunteers and 
d perception. Dissert. Abstr. 17, 684 


Norman, R, D. (1948). A rev 


iew of some Problems related t 
naire technique. Educ. psychi 


© the mail question- 
ol. Measmt. 8, 235-247, 


Pollin, W. and Perlin, S. (1958). Psychiatric evaluation of ‘normal contro!’ 
volunteers. Amer. J. Psychiat, 115, 129-133, 


Richards, T. W. (1960). Personality of subjects who y 


olunteer for research on a 
drug (mescaline). J. Proj. Tech, 24, 424-428, 


R. Rosenthal 41 


Riecken, H. W. (1962). A program for research on experiments in social psychol- 
ogy. In Washburne, N. F. (Ed.), Decisions, values and groups, Vol. II. New York: 
Pergamon Press. Pp. 25-41. 

Riggs, Margaret M. and Kaess, W. (1955). Personality differences between 
volunteers and nonvolunteers. J. Psychol. 40, 229-245. 

Rosen, E. (1951). Differences between volunteers and nonvolunteers for psycho- 
logical studies. J. appl. Psychol. 35, 185-193. 

Rosenbaum, M. E. (1956). The effect of stimulus background factors on the 
volunteering response. J. abnorm. soc. Psychol. 53, 118-121. 

Rosenbaum, M. E. and Blake, R. R. (1955). Volunteering as a function of field 
structure. J. abnorm. soc. Psychol. 50, 193-196. 

Schachter, S. (1959). The psychology of affiliation. Stanford, Calif. : Stanford 
University Press; London: Tavistock Publications, 1961. 

Schachter, S. and Hall, R. (1952). Group-derived restraints and audience per- 
suasion. Hum. Relat. 5, 397-406. 

Scheier, I. H. (1959). To be or not to be a guinea pig: preliminary data on 
anxiety and the volunteer for experiment. Psychol. Rep. 5, 239-240. 

Schubert, D. S. P. (1960). Volunteering as arousal seeking. Amer. Psychol. 15, 
413 (abstract). (Extended report available.) 

Siegman, A. (1956), Responses to a personality questionnaire by volunteers and 
Nonvolunteers to a Kinsey interview. J. abnorm. soc. Psychol. 52, 280-281. 
Staples, F. R. and Walters, R. H. (1961). Anxiety, birth order and susceptibility 
to social influence. J. abnorm. soc. Psychol. 62, 716-719. 

Strickland, Bonnie R. and Crowne, D. P. (1962). Conformity under conditions of 
simulated group pressure as a function of the need for social approval. J. soc. 
Psychol. 58, 171-182. 

Wallin, P. (1949), Volunteer subjects as a source of sampling bias. 
Sociol. 54, 539-544. 


Weiss, J. M., Wolf, A., and Wiltsey, R. G. (1963). Birth order, recruitment condi- 
tions, and preferences for participation in group versus nongroup experiments. 


Amer. Psychol. 18, 356 (abstract). 


Amer. J. 


the experimenter : 


a neglected stimulus object! 


F. J. McGuigan? 


From Psychological Bulletin, Vol, 60, 1963, 
The American Psychological Association, Re 
1. Modification ofa paper presented at the A. 
meetings, 1961, in a symposium entitle 
Psychological Experiment.” 

The author expresses appreciation to Sherman Ross for h: 
tions concerning the presentation of this paper. 


pp. 421-428. Copyright 1963 by 
Produced by permission. 


merican Psychological Association 
d “The Social Psychology of the 


N 


is valuable sugges- 


42 


F. J. McGuigan 43 


Table 1 
Number of possible data collectors in a sample of 
articles from the Journal of Experimental Psychology 


No. of No. of No. of Possible 
Authors Articles Data Collectors 
1 2 3 4 
1 16 10 3 1 2 
2 j A 0 14 2 1 
3 4 0 0 4 0 
Total 37 10 17 7 S 


we have typically regarded the experimenter as necessary, but undesir- 
able, for the conduct of an experiment. Accordingly, in introductory 
textbooks on experimental psychology we provide prescriptions for 
controlling this extraneous variable; but seldom do we consider the 
experimenter variable further, and the extent to which we actually 
control it in our experimentation can be seriously questioned. As 
documentation for this statement, consider some findings based on an 
analysis of 37 usable articles from three recent issues (selected at 
random) of the Journal of Experimental Psychology. These articles 
were classified according to the number of possible data collectors and 
number of authors. In Table 1 we can see that 10 of the 37 articles had 
only one possible data collector. It is reasonable to assume that at least 
a majority of the other 27 experiments employed more than one data 
collector. In no article was any mention made of techniques of con- 
trolling the experimenter variable and in only one of the articles was 
the number of data collectors actually specified. Furthermore, in no 
article was a statistical analysis of results as a function of experimenters 
reported. It seems quite clear that we are deficient in the write-up and 
analysis, ifnot in the design of our experiments as far as the experimenter 
variable is concerned. The possibility is alarming that in multidata 
Collector experiments adequate control is not exercised. Especially 
is this so for those psychologists who have witnessed in amazement the 
conduct of experiments by some of their colleagues in which one 
experimenter collects data for a while, after which he is relieved by 
another experimenter, with no plan for balancing the subjects in the 
groups over the experimenters. Such an experiment is totally indefen- 
sible. But it is, optimistically, assumed to be relatively rare. Where 
pains have been taken to control the experimenter variable in multi- 
experimenter experiments, it is unreasonable to request that results be 
Presented as a function of experimenters. This request has three bases : 
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(a) it will justify the control procedures used, (b) it will help indicate te 
extent to which the results are generalizable to a population i 
experimenters, and (c) it will provide much needed information on the 
extent and nature of the experimenter’s influence on the subjects. 


Point a needs no further elaboration. But Points b and c can profitably 
be developed. 


Sampling from a population 
of experimenters 


Assume that in a given experiment it was possible to control the experi- 
menter variable in a completely adequate fashion by holding that 
variable constant. This means that the numerous stimuli emanating 
from the experimenter-stimulus object have assumed the same constant, 
but unspecified, value for all the subjects throughout the experiment. 
Whatever the intensity and other values of these experiment-produced 


stimuli, we are assuming that they have not differentially affected the 
behavior of the subjects. 


Clearly such a techni 
not practical. But that 
variable by holding it co 


expect that the stimuli emitted by 
either be different in nature, or i 
stimuli differentially affect the d 
subjects in the two experiments? 
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_ The problems arising in the sampling of subjects for experimenta- 
tion have received considerable attention—the undergraduate psy- 
chology major who is not aware of the mechanics of obtaining a 
random sample from a well-defined population of subjects is probably 
arare specimen. While the way was paved some years ago, particularly 
by Brunswik (e.g., 1947), however, the same cannot be said with regard 
to other populations relevant to experimentation. Brunswik empha- 
sized the importance of sampling stimulus populations, but rarely 
are such populations actually systematically sampled in psychology 
today—especially is this true of the subclass of stimulus variables 
emitted by the experimenter who faces the subjects. On any given 
problem, we could define a population of experimenters, although 
admittedly not easily in an unambiguous fashion. In our conduct of 
an experiment on that problem, then, strictly speaking we should 
employ a design (such as a complete factorial design) that allows us to 
vary experimenters—we should randomly sample from a population of 
experimenters and replicate the experiment for each experimenter used. 

Now let us return to our question: does the fact that two experi- 
menters who differ only in regard to a single characteristic affect the 
performance of subjects in two otherwise identical experiments? 


There are three general answers possible. 


Case 1, First, the stimulus characteristic in question is totally un- 
related to the dependent variable being measured. In this event 
essentially the same scores would be obtained by both experimenters. 
Clearly in this case, we need not be concerned in the slightest as to 
whether or not experimenters in our hypothetical population differ— 
their respective characteristics have no differential effects on the 
dependent variable. There is but one remaining point: we could not 
Possibly know this unless we had designed and analyzed our experiment 
to find it out. 


Case 2. The second general possibility is that the variable for which the 
two experimenters differ does affect the dependent variable, but it 
affects all subjects in the same way, regardless of the experimental 
condition to which those subjects were assigned. For example, we 
Might suppose that subjects assigned to the anxious experimenter 
Perform at a higher level on the average, than do those assigned to the 
nonanxious experimenter. . 
Typically, we are interested in whether or not one group of subjects 
performs higher or lower than a second group on a given dependent 
variable measure. Since in this second case we are able to reach the 


same conclusion with regard to our hypothesis regardless of which 


> a 
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experimenter conducted the experiment, we are Spence 
interested in the experimenter difference. As an adjunct to this 
experiment, however, we note that a particular kind of behavior is 


influenced by this experimenter characteristic, information that is 
potentially valuable. 


Case 3. The first two possible answers to our question do not greatly 
concern us. The third, however, can be rather important. To take an 
extreme case, let us say that the performance of an experimental 
group is superior to that ofa control group for the anxious experimenter, 
but that the reverse is true for the nonanxious experimenter. In short, 
suppose that there is an interaction between the characteristics with 
which the experimenters differ and the independent variable of the 
experiment. y 
As an example of a Case 3 experiment, briefly consider an inter- 
action reported by Kanfer (1958). Two experimenters who had 
“minimal gross differences” participated in a verbal conditioning 
experiment. The subjects were required to say words continually and 
the verbs that they emitted were reinforced by flashing a light according 
to one of three reinforcement schedules. The experimenter’s task was 
simple—to discriminate between verbs and nonverbs, and flash a light. 
The results indicated a significant Method x Experimenter interaction 
—there was more frequent reinforcement of words for one schedule 
than for the others, the frequency varying for the experimenters. The 
experimenters evidently differed from each other in their ability to 
perceive verbs as a function of reinforcement schedule. The reason for 
this seems obscure, but the lesson to the investigator is again driven 
home—f our results are a function of experimenter characteristics, 
then they are highly specific and cannot be generalized. 
It should be emphasized that interactions involving experimenters 
may not only be unexpected, but quite obscure. In general we simply 
have not had enough experience with experimenter interactions to 


know where to look for them. To further emphasize the obscurity of 


this type of interaction, consider some results from a study involving 
four methods of learning and nine experimenters (McGuigan, 1960). 
The analysis of variance indicated that there was a significant difference 
among methods but that experimenters did not differ, and particularly 
that the methods by experimenter interaction was not significant. 
According to our normal procedure, we would conclude that the 
results with regard to methods is not a function of experimenters. 
But now let us study the interexperimenter variability more closely. 
Wecan note that there is considerable variability among experimenters 
for Methods P and VIW in Figure 1, but that there is relatively little 
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Dependent variable scores for four methods plotted as a function of 


experimenters (after McGuigan, 1960). 


interexperimenter variability for Methods IW and W. The variance 
for each method was computed and it was found that they differ 
significantly. Furthermore, the variability among the experimenters is a 
function of methods when methods are ordered from P to VIW to IW 
and to W, 


In Figure 1 we arranged the experimenters on the horizontal 


axis in a random fashion. Lines of best fit are approximately parallel. 
In Figure 2, however, we have arranged the experimenters according 
to intraexperimenter variability. Now lines of best fit appear to 
deviate rather markedly from being parallel. Particularly note that the 
relative proficiency due to the various methods is a function of the 
experimenters. Here we have a single experiment replicated nine 
times. Suppose that we had conducted the experiment only once using, 
Say, Experimenter Number 9. This experimenter yielded a clear set of 
results due to methods. But had we chosen Experimenter Number 8, 
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` Figure 1 ordered according to intraexperimenter 
variability, 


of conversations by student experimenters indicated considerable 
variation in reinforcement techniques as well as downright distortion. 
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Table 2 

; 

t's between personality characteristics of nine 
experimenters and dependent variable scores 
(time to perform a task) of their subjects 


Bernreuter Bell Manifest Anxiety 
BI-N: 35 A: 15 04 

B2-S: 19 B: 15 

B3-I: 24 C: .09 

B4-D: 14 D: .08 

e 16 

F2-S; 23 


Stories of violation of proper data collection procedures by graduate 
assistants are legion, if somewhat suppressed. Analyses to determine 
differences among experimenters on dependent variable scores can 
Serve to at least stimulate investigation of procedural problems in a 
8iven experiment. The second possible difference among experimenters 
in Figure 2 concerns what we might call personality characteristics 
of the experimenters—we have a possible ordering of experimenters 
along some personality dimension. The only question is what is it 
and how might we discover it? One thing we could do when we find 
this sort of interaction is to administer a battery of personality tests to 
Our experimenters, in an effort to determine personality differences 
that differentially influenced a given dependent variable. Hints can 
thus be obtained that can lead to additional experiments in which 
Characteristics of the experimenters are varied in an effort to better 
understand the nature of these interactions that concern us. We 
actually did this for the experimenters of Figure 2. Table 2 shows a 
Sample of the correlations between trait scores of the experimenters 
and dependent variable scores of their subjects. None ofthe correlations 
Was significant, but several were high enough to be somewhat suggestive 
even without significant differences among experimenters, and with 
Such a limited sample. As an illustration: the more neurotic (B1-N 
Scale of the Bernreuter) the experimenter the poorer the performance of 
the subject. i ‘ 
Experiments in which experimenters with different personality 
Characteristics were deliberately used are few in number. One such 
Study was a verbal conditioning experiment using the response class 
of hostile words emitted in sentences (Binder, McConnell, and 
JOholm, 1957). Whenever the subject used a hostile word in a sentence 
e experimenter reinforced that response by saying “good.” Two 
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Learning curves for two groups treated the same except for experimenters. 
(The steeper slope for the subjects of the female experimenter illustrates 
an interaction involving experimenters—after Binder et al., 1957.) 


groups were used, a different experimenter for each group. The two 
experimenters differed in 


gender, height, weight, age, appearance, and 
personality: 


The first...was...an attractive, soft-spoken, reserved young lady - « - 
5'¥' in height, and 90 pounds in weight. The... second... was very 
masculine, 6'5” tall, 220 pounds in weight, and had many of the unre- 
strained personality characteristics which might be expected of a former 
marine captain—perh 


taps more important than their actual age difference 
of about 12 years was the difference in their age appearance: the young 


lady could have passed for a high school sophomore while the male 


experimenter was often mistaken for a faculty member (Binder et al., 
1957, p. 309). 


The results of this experiment are shown in Figure 3. We can see 
that the rate of emitting hostile words increases with trials for both 
groups—saying “good” reinforced the response for both experimenters. 
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(The marked departure of the Hy- 
ates an interaction between set 
characteristics of the subjects— 


Learning curves for four conditions. 
Positive-set_group—solid line—illustr 
for the experimenter and personality 
after Spires, 1960.) 


ow is the fact that the rates of 


But of particular significance to us n 
gnificantly—the slope is steeper 


learning forthe two groups differed si 
for the female experimenter’s group. Clearly the differences between 
the two experimenters are numerous, SO it is difficult to specify just 
what experimenter characteristic or combination of characteristics 
is responsible for this difference in learning rate of the two groups. But 

A follow-up of it might be aimed at 


ni research is a promising start. 
€sting the authors’ speculation as to the important difference: that 
the female experimenter “provided a less threatening environment, and 


the S’s consequently were less inhibited in the tendency to increase 


their frequency of usage of hostile words” (Binder et al., 1957, p. 313). 


An interesting experiment by Spires (1960) is illustrative of how 
characteristics of the subjects can interact with perceived characteristics 
of the experimenter. Spires selected a group of subjects high on the Hy 
scale of the MMPI and a second high on the Pt scale. The subjects 
entered the experimental situation with one of two sets: the positive 
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set was where the subject was told that the experimenter was a “warm, 
friendly person, and you should get along very well”; the negative set 
was where the subject was told that the experimenter may “irritate 
him a bit, that he’s not very friendly, in fact kind of cold.” This was a 
verbal conditioning study in which a class of pronouns was reinforced by 
saying “good.” An analysis of variance for Spire’s results indicated that 
there was a significant difference between positive and negative sets 
for the experimenters (the positive set leading to better conditioning), 
and that the interaction between set for the experimenter and MMPI 
score of the subject was significant. This interaction is illustrated by 
the learning curves shown in Figure 4. There we can see that the 
hysterics, who had a positive set for their experimenter, condition 
remarkably better than the other three groups. While apparently this 
is the only study which shows that a rather well defined personality 


characteristic of the subjects interacts with a perceived characteristic 


of the experimenter, further investigation would undoubtedly yield 
additional interactions of this nature. 


Conclusions 


2. Where more than one data col 
of control should be specified, (b) 


are generalizable to a population o 3 

which such a population has been sampled. Granted that completely 
satisfactory sampling can seldom occur, at least some sampling is 
better than none. And it is beneficial to know and to be able to state 
that, within those limitations, the results appear to be instances of 
Cases I or II. If the experiment turns out to be an instance of Case IIL, 
the extent to which the results can be generalized is sharply limited. 


F. J. McGuigan 53 


One can only say, for instance, that Method A will be superior to 
Method B when experimenters similar to Experimenter Number 1 are 
used, but that the reverse is the case when experimenters similar to 
Experimenter Number 2 are used. This knowledge is of course valuable, 
but only in a negative sense since.we do not know what the character- 
istics of the two experimenters are—to understate the matter, the 
interaction tells us to proceed with considerable caution. 


3. It is important to contribute to our general fund of knowledge of 
the experimenter variable, for it is indeed small at this time. That this 
request to collect relevant data will not excessively burden us is 
indicated by the frequency with which more than one experimenter 
already participates in an experiment (see Table 1; further, note that 
in a sample of 722 articles from journals concerned primarily with 
experiments, 48°% had two or more authors [Woods, 1961]). Quite 
clearly we already have enough information to safely assert that 
interactions between experimenters and treatments do occur. But 
there is a paucity of data about their frequency of occurrence as a 
function of type of experimental situation. By designing more experi- 
ments to test for differences between experimenters and for interactions 
involving experimenters we may eventually be able to handle the 
problems indicated in Number 1 above and by instances of Case Ill. 


As with all other variables with which we are concerned, deter- 
Mining the effects of the experimenter variable is a long, energy 
consuming project. But we must face up to our task. Recognizing the 
enormity of this project, one can well ask whether or not there is a 
more efficient approach. The only other possibility that occurs at 
present is to eliminate the experimenter from the experiment. For 
some problems that we study, this would be relatively easy, but it 1s 
hard to visualize how this could be accomplished in other experiments. 
For instance, a number of completely automated devices have been 
developed and successfully used in running rats—the subjects are 
never exposed to a human experimenter. Automation has also 
entered psychology at the human level, but in neither case is automation 
very general, and certainly it is not standardized. In a number of 
experiments it seems reasonable to have the subject enter the experi- 
mental room and be directed completely by taped instructions, thus 
removing all visual cues, olfactory stimuli, etc., emitted by the experi- 
menter. If eventually the human experimenter 1$ replaced by devices 
which automatically. run the subject through his routine, we must be 
careful not to select values of stimuli emanating from these devices that 


themselves interact with the treatments that we are studying. 
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McGuigan (1963) states: 


While we have traditionally recognized that the ey a 
experimenter may indeed influence behavior, it is important to obser 


that we have not seriously attempted to study him as an independent 
variable [p. 421]. 


However, Stumpf with his careful, detailed measurements of ae 
tioners’ cues began the study of the experimenter as an independe d 
variable in 1904, but not until recently has this problem been con : 
by experimental psychologists for study (Cordaro and Ison, 19¢ 7 
McGuigan, 1963; Rosenthal and Halas, 1962), Clinical peice 
have long led the way in this aspect of investigation. The person 

effect of examiners upon patients’ performance in clinical tests re 
initiated as an object of study 35 years ago (Marine, 1929). Y? 
psychologists working in the laboratory have not been completely 
unaware of the implications of experimenter influence upon subjects. 


Ebbinghaus (1913) in discussing the effects of early data returns 
upon psychological research stated : 


It is unavoidable that, after the observat 
suppositions should arise as to general pri 
them and which occasionally give hints 
investigations are carried further, these 
present at the beginning, constitute a complicating factor which probably 
has a definite influence upon the subsequent results [pp. 28-29]. 

Pavlov, noting the apparent in 
generations of mice i 
characteristics, suggested that 


ion of the numerical results, 
nciples which are concealed in 
as to their presence. As the 
suppositions, as well as those 


(Gruenberg, 1929, p. 327). 
The foregoing yields some in 
phenomenon. However, res 


affected result. A study in which experimenters (E’s) recorded the 
frequency of contractions and head turns of 


In this case subjects 
statistically significan 
were obtained for E'S 


> 
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Thus far we have seen that E may not only bias S’s responses but 
also that this interpretation of S’s responses may be biased. Because’ 
these effects are dependent upon E’s knowledge of the hypothesis to 
be tested or his expectancy, one can readily propose that the solution 
to this problem would be, as is often the case, a simple matter of having 
research assistants (4’s) who are unaware of the hypothesis collect the 
data for E. In testing this suggestion, it was found that “a subtle transfer 
of cognitive events” existed, resulting in response bias (Rosenthal, 
Persinger, Vikan-Kline, and Mulray, 1963, p. 313). The authors state: 


Our finding of a subtle transmission of E’s bias to their A’s forces us to 
retract an earlier suggestion for the reduction of E bias (Rosenthal, 1963c). 
Our recommendation had been to have E employ a surrogate data 
collector who was to be kept ignorant of the hypothesis under test. The 
implication of the suggestion was simply not to have E tell A theh pothesis. 
It now appears that E's simply not telling A the hypothesis may not insure 
A’s ignorance of that hypothesis [pp. 332-333]. 


It is the present authors’ contention that wherever an experimenter- 
subject relationship exists, the possibility also exists for E to contamin- 
ate his data by one or more of a multitude of conveyances. It appears 
that experimental psychology has too long neglected the experimenter 
as an independent variable. By relating some of the findings of clinical 
and social psychologists, as well as the few experimental studies to date, 
it is hoped that experimental psychologists will no longer accept on 
faith that the experimenter is necessary but harmless. Implications 
for experimental, counseling, and testing psychology will also be 
considered. 


Research findings 
Nondifferentiated effects 


Research studies have been, on the whole, minimal in reporting 
differential results with regard to individual E’s. In particular, this is 
true concerning careful discussion of the possible reasons for the 
differing data. It is, however, illustrative of the pervasiveness of 
experimenter effect to examine several of the studies which have shown a 
Nondifferentiated experimenter influence. - y 
Lord (1950) was interested in examining Rorschach responses 1n 
three different types of situations. Thirty-six S’s took the Rorschach 
three times—once from each of three different female examiners. 
Of the Rorschach responses being considered for differences within 
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S’s, Lord found 48 to yield £ tests significant at the .10 level. Of these, 
27 were due to examiner differences. 

In an interesting study on learning without awareness (Postman 
and Jarrett, 1952), 30 different E’s—all students in advanced experi- 
mental psychology classes—were employed. The S’s responded to 
240 stimulus words with another word which came to mind. Half the 
S’s were instructed to guess, and half were told, the “correct” principle 
of answering, which was to give common associations as found in 
speaking and writing. Differences among E’s were highly significant 
Sources of data variance. Postman and Jarrett suggest that since 
complete universal uniformity of experimenter behavior is apparently 
impossible, the difficulty experienced in attempting to replicate results 
of other investigators is to be expected. 


effect was found, significant at the .01 le 

largely due to only one of the four E’s, since repetition of the analysis 
without this E’s data yi i 

An avoidance study using rats (Harris, Piccolino, Roback, and 

s primarily concerned with the effects of alcohol 

ponse in a Miller-Mowrer shuttle box. 


E’s. 


Personality 


In discussing the differential effec 
writing with particular attention 
seems logical to postulate that if 
personality (such as warmth or co 


t of E upon S, Masling (1960) was 
to projective testing. However, 1t 
sex and aspects of the examiner's 
Idness) are causative of differential 


" 


B. L. Kintz et al. 59 


results in projective testing, the influence of these personal variables 
may also be felt in other, even objective, situations. 

In attempting to assess effects of personality factors of experi- 

menters in the experimental situation, McGuigan (1960) compared 
trait scores of E’s on personality tests with dependent variable scores 
of S’s. He did not obtain any significant correlations, but noted several 
quite high ones that may indicate directional influences. For example, 
the more neurotic (BI-N scale of the Bernreuter) the E, the poorer the 
performance of S. 
_ The effect of E’s personality upon Ss’ performances had been 
investigated earlier (Sanders and Cleveland, 1953) using projective 
techniques. Nine E’s took the Rorschach, which was scored blindly 
by two experienced clinical psychologists. The E’s were then trained 
in administering the Rorschach, and each E gave it to 30 S’s. An 
attempt was made to deliberately standardize the questioning proce- 
dures used. After taking the Rorschach, each S filled out a question- 
naire designed to elicit his attitudes about the particular E. Sanders 
and Cleveland found that overtly anxious E’s (as indicated by their 
own Rorschach responses) tended to elicit more subject flexibility and 
responsiveness, while overtly hostile E’s (again measured by their 
Rorschach responses) drew more passive and stereotyped responses 
and less of the hostile responses. The Ss’ questionnaires indicated that 
E’s who were most liked were those who had been rated low on anxiety 
and hostility. R ; 

The research just mentioned has been primarily interested in the 
effect of the personality of E, per se, on Ss’ performances. One further 
Study is especially interesting, as it tries to answer the pertinent 
question of whether E’s personality and personal bias can interact. 
Rosenthal, Persinger, and Fode (1962) used 10 naive E’s, who were 
biased to expect certain results. They found that agreement of final 
data and E bias were related to Es’ scores on the MMPI scales, L, K, 
and Pt, but not to age or grade-point average. 

The S-E personality interaction is dependent, of course, not only on 
the Personality of E, but also to some degree on that of S. In one of the 
few studies designed to investigate this interaction, Spires (1960; 
Cited in McGuigan, 1963) used a2 x 2 design in a verbal conditioning 
Paradigm, reinforcing a particular class of pronouns with the word 
“good.” The S's were divided into two groups, one of which had 
Scored high on the Hy scale of the MMPI and one of which had scored 
high on the Pt scale. Each group was subdivided in half, receiving 
cither a positive or a negative “set” (“this experimenter is warm and 
friendly” or “this experimenter is cold and unfriendly”). The high 
Hy-positive set group far surpassed the other three groups. The high 
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Hy-negative set group performed the poorest. Thus, not only E’s 
personality, but Ss’ perception of this personality, can contribute to the 
E effect. 

Investigation of Ss’ perception of E has been undertaken by two 
related studies (Rosenthal, Fode, Friedman, and Vikan-Kline, 1960; 
Rosenthal and Persinger, 1962). In the first experiment, S’s were asked 
to rate E ona number of variables. In the second study, the experiment 
was not actually conducted, but only described, and S’s were requested 
to imaginatively rate their imaginary E. Yet a correlation which was 
calculated between the ratings of the first and second studies yielded an 
r of .81. This would appear to support the hypothesis that naive 
S's, in particular, may have a kind of predetermined “set” about 
what a “typical” E is like—scientific, intelligent, etc. 


Experience 


Investigators with widely variant amounts of experience are busily 
conducting studies every day. Cantril (1 


z A 944) stated that interviewers 
who are highly experienced show as much bias as those who are less 
experienced. 


In an experimental investigation, however, Brogden 
(1962) came to a different conclusion. Four E’s each trained a group 
of rabbits and recorded the acquisition speed of a conditioned shock- 
avoidance response. The rabbits of the three experienced E’s reached 


the learning criterion faster than the naive E’s rabbits, To further study 
this result, the naive E was requi i 


between E’s, 


Sex 


} estigating the manner in 
a E differences in sex. In a verbal 
} y > McConnell, and Sj H e 
reinforced for saying hostile joholm, 1957), S’s wer 


words. Two clearly distinguishable E’s 
were employed: one—a young, petite feminine girl; the other—a 


s (1953) findings that 
S’s than do other E’s. 
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_ Sarason and Minard (1963) also found that sex and hostility 
Significantly influenced Ss’ performances. Degree of contact between 
S and E and E’s prestige value (as perceived by S) also contributed 
Significant effects. Sarason and Minard warn that ignoring these 
Situational variables is hazardous research methodology. 

Ina very recent experiment investigating the sex variable, Stevenson 
and Allen (1964) show what is perhaps the most clear-cut demonstration 
of S-E interaction. Eight male and eight female E’s each tested eight 
male and eight female S’s in a simple sorting task. The mean number 
of responses was recorded at 30-second intervals. With either male 
or female E’s, female S’s made more responses than did male S’s. 
However, all S’s performed relatively better under an opposite- 
sexed E, 


Expectancy effect 


Perhaps the component of experimenter effect which is the cause of 
greatest concern is that by which the E in some way influences his 
S’s to perform as he has hypothesized. The reasons for concern about 
expectancy effect are that so little is known about it and so little 
research has been devoted to it. Only recently have systematic studies 
been conducted in this area. s 
Rosenthal and Fode (1963a) demonstrated the problem clearly in 

an experiment with two groups of randomly assigned animals. One 
group of six E’s was instructed that its group of rats was “maze-bright 
and a second group of six E’s was instructed that its group of rats was 

maze-dull.” In a simple T maze, the maze-bright rats performed 
Significantly better than the maze-dull rats. ; P A 

_ Ina similar study (Rosenthal and Lawson, 1n press) investigators 
divided 38 F's into 14 research teams, each of which had one rat 
tandomly assigned to it. Six of the teams were told that their rats were 
bred for dullness and the other eight were told that their rats had been 

Ted for brightness. Seven experiments, including | such tasks as 
Operant acquisition, stimulus discrimination, and chaining of responses, 
Were conducted. In seven out of eight comparisons (overall P = 02), 

ifference in performance again favored E’s who believed their S’s 
to be bred for brightness. A factor which may have prompted the 

ifference was that E’s who believed their rats were bred for brightness 

andled them more than E’s who believed their rats were bred for 
dullness, 

In both experiments cited, the questio 

Of the animals to attitudinal differences in 


n arises as to the sensitivity 
E’s transmitted through the 
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tactual and sensory modalities. Further research is required to clarify 
the issue. 


Modeling effect 


Modeling effect is defined (Rosenthal, 1963b) as a significant correla- 
tion between E’s performance and the performance of randomly 
assigned S’s on the same task. 

Graham (1960) divided 10 psychotherapists into two groups on the 
basis of their perception of movement in Rorschach inkblots. In the 
ensuing psychotherapeutic sessions, patients of the group of psycho- 
therapists that perceived more movement in the inkblots saw a 
significantly greater amount of movement than the patients of the 
group of psychotherapists that had perceived less movement. 

-~ In the area of survey research the phenomenon of modeling has 
been reported in studies by Cantril (1944) and Blankenship (1940) 
who have found that interviewers elicit from their interviewees, at a 


probability greater than chance, responses which reflect the inter- 
viewers’ own beliefs. 


Rosenthal (1963b) 
the existence and m 
employing the task of 
scale of apparent suc 
—10 to +10. Prior 
which were selected bi 


reported eight experiments conducted to assess 
agnitude of experimenter modeling effect by 
Ss’ rating a series of photographs of people on. a 
cessfulness and unsuccessfulness ranging from 
to each experiment, E’s had rated the photos 
e ecause in earlier ratings on the same scale they 
had yielded a mean value of zero. The resulting eight rank-order 
correlations between E’s ratings and their Ss’ ratings ranged from 
—.49 to +.65. Only the rho of +.65 was significantly different from 
O(p < .001), but the hypothesis of equality among the eight rhos was 
rejected using a chi-square test (p < .005). 

Hammer and Piotrowski (1953) had three clinical psychologists 
and three interns rate 400 House-Tree-Person drawings on a 3-point 
scale of aggression. The degree of hostility which clinicians saw in the 


drawings correlated .94 with the evaluations of their personal hostility 
made by one of the investigators. 


Early data returns effect 


Early data returns effect is the problem of the experimenter who is 
receiving feedback from his experiment through early data returns and 
who contaminates the subsequent data. The reasons why this occurs 
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a unclear but some suggestions are that E’s mood may change if the 
ata are contrary to his expectations, or if the data are in agreement 
prn his expectations, there is the possibility of heightening an existing 
ee There is evidence (Rosenthal, Persinger, Vikan-Kline, and Fode, 
63) that this mood change in E, brought about by “good results,” 
ate lead him to be perceived by the S’s as more “likable,” “personal,” 
nd more “interested” in their work and thereby influence their 
performance. 
to In the study by Rosenthal, 
63), three groups of four E’s eac 


Persinger, Vikan-Kline, and Fode 
h had three groups of S’s rate the 
graphs on a scale ranging from 
d that Ss’ mean rating would be 
S’s were con- 


ng obtained bad data 
elation to the control. 


eb group, the experimental groups 
Sittin each other. There was a further tendency 
urns to become more pronounced int : 
Griffith (1961) states clearly the efféct of early data returns in an 
autobiographical documentary: 


Each record declared itself for or against . - - (me)... (and)... (my)... 
Spirit. rose and fell almost as wildly as does the gambler whose luck 
Supposedly expresses to him a higher love or reflection [p. 309]. 


(6) nace 
th Lerview of cues and 
teir transmission 
perimenter effects, the 


h the various exp! f 
to how the experimenter contaminates 
re they transmitted? Some 


essary to look at evidence 


ater discussing at some lengt 
i estion must certainly arise as 
S data. What are these cues and how a 


qeeestions have been made but it is nec k c 
ealing directly with the problem. It was suggested earlier that in the 


Case of laboratory animals it might be due to tactual and kinesthetic 
Cues, but probably also involved are all of the sensory processes of the 
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organism so that E inadvert 
that he does. 


In dealing with humans, because of the probable lack of Le 
contact, cues are transmitted verbally and/or visually. But “verbally 


implies not only the words, but also the inflectional and dynamic 
Processes of speaking. 


The transmission of v 


ently transmits cues by nearly everything 


erbal cues was first dramatically demonstrated 
by Greenspoon (1955) who, by reinforcing plural nouns with “mmm- 
hmmm,” was able to increase the frequency of emission of such words. 
In a similar experiment, Verplanck (1955) was able to control i 
content of Ss’ conversation by agreeing with some opinions Es 
disagreeing with others. The results showed that every S increase 
in his rate of speaking opinions with reinforcement by agreement, 


and 21 out of the 24 Ss decreased their rate of opinion statements 
with nonreinforcement. 


Rosenthal and Fode (1963b) 
designed to investigate the tran 
S’s. The S’s were to rate the a 
photographs ‘on a scale rangi 
identical instructions e 


conducted two experiments specifically 
smission of cues from E to his human 


scale as the $’s, owed that S’s for high-biased (+5) E'S 
obtained Significantly highe 


(—5) E’s. Since E’s were not 


Wickes (1956) also showed the import of visual cues by effectively 
using nodding, smiling, and leani : 
ment for certain responses given to inkblots by clients in psycho- 
therapy sessions. 


Considerable research is required to learn 


l what the cues are, how 
they are transmitted, and how they can be controlled. 
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Implications of the 
experimenter effect 


Ee pemde ade of the literature has revealed the existence of the 
oie oe er effect in all aspects of psychology. Although the experi- 
Rods to es is generally recognized and perhaps paid lip service, it 
Cae a forgotten skeleton in the research psychologist’s closet. 
ae n of a study by Postman and Jarrett (1952) with one by 
ce (1964) provides an example. 
Postman and Jarrett (1952) commented : 
pele paja too little attention to the contributions made by variations 
Bee oi ehavior to the experimental results. The difficulty which many 
rk ies experience in repeating the results of other investigators 
eons ne to our failure to attack systematically the role of differences 
g E's [p. 253]. 


s of variability occurring 
Anxiety scale, says the 
experimenter-subject 


S wae A 
REN (1964), after examining various aspect 
Mine enS using the Taylor Manifest 
papo vng n concluding his discussion of the 
nteraction : 


important variable and should be 


This is, 
is is, nevertheless, a potentially 
erately manipulating the behavior 


3 igated further, possibly by 
of E p b delib 
of [p 136]. 


wo statements that during the 


It is clear from a comparison of these t j 
and controlling the experi- 


= oie years the progress in examining 
object effect has been something less than spectacular. Thus, the 
alee of this portion of the present paper Is to attempt to alter 
eter: research procedure by emphasizing the implications of the 
i imenter effect as it relates to the individual psychologist engaged 


in hi ‘ 
his varied activities. 


Glaser 
linical implications 


Clinic;- . 
ctinicians have long recognized the influence of the experimenter 
‘Nerapist) upon the behavior of a subject (client). In fact, the differing 


vie Se 2 + 5 
a existing in the clinical realm as to the most effective therapeutic 
rocedure to utilize seem to have their origin in the clinician’s concep- 


tio: : q 
eae the role of the therapist in the therapeutic situation. For 
Sint the psychoanalyst believes transference is essential if the 

is to be led to adjustment, whereas the nondirective therapist 
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strives to accompany the patient along the road to adjustment rather 
than to lead. 

Even though clinicians not only recognize but argue over the 
implementation of the experimenter influence, they are not exempt 
from a thorough evaluation of the implications (some of which are 


discussed below) that the experimenter variable holds for the clinical 
field. 


defined therapeutic procedures. 


Other more specific Clinical areas affected by E-S interaction 
would includ 


nclude the effect upon patients’ Rorschach scores as a function 
of experimenter differences (Lord, 1950), 


n-client interaction, 


ht involve the objective assignment of 


Sofa large-scale correlational deter- 
it “personality types” interact most 
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Implications for the 
field of testing 


The general field of testing which would include IQ tests, placement 
tests, reading readiness tests, aptitude tests, etc., is also beset with the 
Problem of the experimenter variable. Even though there have been 
rigorous attempts at standardization of test items and procedures in 
this area, E or administrator of the test still influences the test taker in 
other subtle ways (Kanfer, 1958; Rosenthal, 1963a). 

The implications of the experimenter effect in the testing area 
have many ramifications. It is questionable whether many tests have 
been proven sufficiently reliable and valid in their own right, and this 
Problem is further complicated by the experimenter variable. Judg- 
ment of an individual's score on special abilities and IQ tests, etc., must 
Not only be viewed in light of which test was used, but must also take 
Into consideration the previously ignored variable of the specific 
administrator. In addition to knowing that a person achieved an IQ 
Score of 105 on the Stanford-Binet and not the Wechsler, it is also 
necessary to know whether or not E was threatening, docile, friendly, 
anxious, or expected the test taker to be smart, dumb, score well, 
etc. (Binder, McConnell, and Sjoholm, 1957; McGuigan, 1963; 
Rosenthal, 1963a). 

The administrator contamination problem may eventually be 
resolved by the application of machines to the administration of tests. 

t this time a more judicious selection of the hundreds of available 
tests on the part of administrators, using test results to guide their 
decision-making process, is essential. In addition, test results should 

e viewed with a more sophisticated, critical eye, with IQ and aptitude 
Scores being considered as but some of many indices of performance. 

ll persons using test scores must recognize the strong influence of E 


and make decisions accordingly. 


Experimental 
implications 

in controlled experimentation should 
de a control for himself. That this 
by Woods’ (1961) investigation of 
ch 42%-45% involved multiple 


The psychologist engaged ; 
realize that he has failed to provi 
Variable is disregarded is evidenced 


1 737 & % P hi 
> published experiments, of which ć 0 SMBS: 
authorship. None ais ran an analysis of experimenter interaction. 


One particular aspect of controlled experimental endeavor we 
has neglected the experimenter effect is learning-theory research. 
uch energy is expended on “crucial” experiments which ostensibly 
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attempt to determine which of the conflicting theories of Hull, eee 
Guthrie, and others are correct. At the present time these ore 
experiments have produced results which are generally cones yy z 
except for establishing a high correlation between the theory an 
results support and his theoretical position. : 

The PF nA already reviewed provide a speculative base RA 
partially explaining the conflicting results obtained by the beget 
of various learning theories. As Rosenthal (1963c) has shown, sche 
menter bias is a powerful influence in the experimental situation. a 
E has many opportunities to influence, unintentionally, S’s who se 
been brought into a very strange, highly structured situation. In Me 
of this, it is not surprising—it should be expected—that E’s favoring 


$ 4 s is 
a particular learning theory would tend to obtain results favoring thi 
same theory. 


Results reported recently (Cordaro and Ison, 1963; Rosenthal and 


Fode, 1963a; Rosenthal and Halas, 1962) indicate that E’s also affect 


the results of studies using nonhuman $’s. These findings further 
emphasize the Possibility that the 


Eisa powerful, yet much ignored. 
that even m 


failed to co 


, variable. It is a strange pardo 
any of the most adamantly scientific of psychologists hav 
ntrol for the experimenter variable., 


Conclusions 
Future experimentation mi 


ght prove more profitable if more rigorous 
communication could be established between researchers of differing 


ange were implemented, it might prove an effective 
means of controlling th 
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T 4 ; 
erat ane included counterbalancing of E’s and the use of 
oi oe which include the experimenter as a major independent 
CARRE (1960; cited in Rosenthal, 1963), as reported by 
We behat Bch found that both visual and auditory cues influenced 
Sree 2 S’s. Thus, another suggestion involves the elimination 
peciiaries ete cues, including inflections of the voice, speaking 
Britons gestures, etc., as transmitted to S’s during the reading of 
od aes which began with a discussion of a horse and the 
fhe n TENE cues, has ranged far afield. We have seen that 
BD bebe enter effect exerts an insidious influence upon the relation- 
A AA aag counselor and client. Indeed, the more objective and 
To be a the counselor, the greater the potential hidden effect. 
eee cE of the relationship between counselor and client 
Shenae is to lose much of the control that a counselor must 
Get Be ae the counseling situation. In the same way teachers 
[BE goals an that objective appraisal by their students is affected by 
but which the students believe their teachers have. And finally, 
Probably most important at this time, directors of laboratory 


Tesec 
nee who use student E’s, must be aware of the extremely great 
of their personal biases which can be perceived by the student E’s 
imental effect. 


and transie : j AA 
d translated into practically any significant exper! 
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Robert Rosenthal 


S research by using tables of rand hey are, rather, 
osen because the experimenter has certain expectations about the 
relationship or lack of relationship between the selected variables and 
certain other variables. A superficial exception to this might be seen 1n 


So-called heuristic hunts for relationships, which are perhaps more 
however, the inclusion 


Sarmon to the behavioral sciences. Even here, ¢ > 
variables is not on a random basis, and certain relationships appear 
More likely to be found than others. 

. Experimenters then often, if not always, have some sort of expecta- 
tions about how the data will fall. Also often, if not always, they care 
about how these data fall. Some outcomes may be expected more than 
thers; some outcomes may be desired more than others. Our 
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Purpose here is to discuss the question of whether experimenter’s 
orientation (expectations and wishes) can affect the data actually 
obtained in his research. We are not so much concerned here with the 
problem of choice of experimental design or procedure and the fact that 
certain designs and procedures may unintentionally be more or less 
favorable to obtaining expected or unexpected data. Neither are we 
concerned with the problems of statistical tests of hypotheses and the 
fact that uniquely most powerful Statistics may unintentionally be 
employed when the expectation is to be able to reject the null hypothesis, 
while less powerful statistics may be employed when the expectation 1s 
to be unable to reject the null hypothesis, These are interesting 
questions but will not be considered here. Our usage of “results’ 
or “outcome” will be restricted to the raw data obtained by experi- 
menters from their subjects. 

The effects of experimenter’s outcome Orientation, or bias, were 
seriously considered by Wilson (1952), the physical scientist. Wilson 
felt that positive or expected data might too often occur because of 
researchers’ interest in the Outcome of their experiments, Their 
expectancies about data might determine in part the data obtained. 

i most related to Merton’s (1948) concept of self- 
fulfilling prophecy. One Prophesies an event and the expectation of 

es the prophet’s behavior in such a way as to make 
more likely. Related, too, is Heider’s (1958) 


ausality” and his discussion of the fulfillment of 
personal expectancies. 


Outcome-oriei:ation 
effects in everyday life 


The way a man golfs or bowls 
his performance. Of greater j 
person expects another to perform 
part how he actually does perform. 
group of young men, Whyte (1943) found that the 


group, and especially 
its leaders, “knew how well a man should bowl.” 


This “knowledge” or 


Fascinating data collected at the Bank Street College of Education 
suggest that in the schoolroom as in the bowling lanes, expectancies 
may be powerful forces determining others’ behavior. Data described 
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by John Niemeyer,? President of the College, lend support to the 
hypothesis that lower-class, minority-group children are low achievers, 
at least partly because of their teachers’ expectation that these students 


are really not educable. 


Outcome-orientation effects 
ìn clinical practice 


As highly skilled a clinician as Fromm-Reichmann (1950) was impressed 
by the effects of the self-fulfilling prophecy, although she did not use 
that term. She spoke rather, as other clinicians have, of iatrogenic 
Psychiatric incurabilities. The therapist’s expectancy, she felt, might 
determine whether given symptoms might be relieved or cured. This 
Clinical impression is somewhat supported by the work of Heine and 
Trosman (1960) who felt that the variable significant for a patient's 
Continuance in psychotherapy was that of mutuality of expectation 
between therapist and patient. Goldstein (1960) found no client- 
Perceived personality change due to psychotherapy related to therapist's 
expectancy of such change. However, therapist's expectancy was 
related to duration of psychotherapy. Additionally, Heller and 
Goldstein (1961) found therapist’s expectation of client improvement 
Significantly correlated (.62) with change in client's attraction to 
therapist. These workers also found that after 15 sessions, clients 
havior was no more independent than before, but their selgesonip 
tions were of more independent behavior. The therapists generally 
were favorable to increased independence and tended to = 
Successful cases to show this decrease in dependency. Clients may be 
have learned from their therapists that independent-sounding verba g 
tions were desired and thereby served to fulfill their therapist sage ? 
ancy, The role of expectancy in the psychotherapeutic Aution a 
cen most fully discussed and reviewed by Goldstein (1962). tei 
But psychotherapy is not the only realm of clinical precti aa 
Which expectancy effects may determine outcomes. The fata ity r: z 
of delirium tremens have recently not exceeded about 15% mo E 
tom time to time new treatments of greatly varying sorts are EIE adea 
reduce this figure almost to zero. Gunne’s (1958) won 1 Editorial 
Summarized by the Quarterly Journal of Studies on Alcoho attalit 
Staff (1959) showed that any change in therapy led to a drop inigo ae 
ate. One interpretation of this finding is that the ascent one 
treatment expects a decrease in mortality rate, an co <n A Aa 
eads to subtle differential patient care Over and above pi 


2. Personal communication, 1961. 
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treatment under investigation. A prophecy again may have been 
self-fulfilled. 

In the practice of medicine in general, the role of physician expect- 
ancy looms large. In a very comprehensive paper dealing with placebo 
effects, Shapiro (1960) cites the well-known admonition: “You should 
treat as many patients as possible with the new drugs while they still 
have the power to heal [p. 114].” The wisdom of this statement may 
have its basis in the concept of the physician’s faith in the power of the 
drug. This “faith” may have at its core expectancy as we are discussing 
it. The physician’s expectancy about the efficacy of a treatment may 
be subtly communicated to the patient with resulting influence on the 
patient’s psychobiological response. 


Outcome-orientation effects in survey research 


plied as expected. Hyman, 
1954) took vigorous exception to 
ascribe his remarkable results to 
. The plain fact, of course, is that 


information about the boys’ reliability, Sociability, and stability, but 
told not to regard these data in assessing the boys. Standardized 
questions asked of the interviewers at the conclusion of the study 


R. Rosenthal 77 


suggested that biases of assessment occurred even without inter- 
viewers’ awareness and despite conscious resistance to bias. Harvey 
felt that the interviewers’ bias evoked a certain attitude towards the 
boys which in turn determined the behavior to be expected and then the 
interpretation given. Note how neatly this formulation fits the model 
put forth by Merton. Again, we cannot be sure that subjects’ responses 
were actually altered by interviewer expectancies. The possibility, 
however, is too provocative to overlook. 

_ More recent evidence for an expectancy (outcome orientation) 
bias comes from the work of Hanson and Marks (1958). The most 
thorough discussion of this problem for the survey research literature is 
ar by Hyman et al. (1954), which also carries an extensive bibliog- 
aphy. 


Outcome-orientation effects in experimental research 
It is well known that a great many studies have been conducted to 
establish the validity or invalidity of the Rorschach technique of 
Personality assessment. A systematic study of 168 of these studies was 
undertaken by Levy and Orr (1959) who categorized each study on each 
of the following dimensions: the academic versus nonacademic 
affiliation of the author, whether the study was designed to assess 
Construct versus criterion validity, and whether the outcome ofthe study 
Was favorable or unfavorable to the hypothesis of Rorschach validity. 
€sults showed that academicians were more interested in construct 
validity and that their outcomes were relatively more favorable to 
Construct validation and less favorable to criterion validation. On the 
asis of their findings, these workers called for more intensive study of 
€ researcher himself. “For, intentionally or not, he seems to exercise 
8teater control over human behavior than is generally thought [p. 83]. 


Roan findings reported were a case of the 
SEE nee oat might have been that the 


elect of out ientati bias. It 
r come orientation or Dias. c 
choice of specific hypotheses for testing, or that the choice o ee i 
testing them determined the apparently biased outcomes. every] 
least, however, this study accomplished its task of calling attention to 


Potential biasi imenters. 
l biasing effects of experi employed a straightforward 


Perhaps iest study which KOENE 
experimental ee a A manipulated an outcome-orientation 
Variable was that of Stanton and Baker (1942). In their study, 12 
Nonsense geometric figures were presented to a group of 200 under- 
Sraduate subjects. After several days, retention of these figures was 
Measured by five experienced workers. Experimenters were supplied 


With a key of “correct” responses, some of which were actually correct - 
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but some of which were incorrect. All experimenters were explori 
warned to guard against any bias associated with their having the 
keys before them and therefore influencing their subjects to guess 
correctly. Results showed that the experimenter obtained outcomes in 
accordance with his expectations. When the item on the key was 
correct, the subject’s response was more likely to be correct than when 
the key was incorrect. In a careful replication of this study, Tinda 
(1951) emphasized to his experimenters the importance of keeping the 
keys out of the subjects’ view. This study failed to confirm the Stanton 
and Baker findings. The 85 subjects of Lindzey’s study were much more 
of a volunteer population than were the subjects of the original se 
We simply cannot say whether this fact might have accounted (in wher 
or in part) for the difference. Another replication by Friedman (194 ) 
also failed to obtain the significance levels obtained in the original. 


Still, significant results of this sort, even occurring only in one out of 


three experiments, cannot be dismissed lightly. Stanton (1942, see 
pp. 16-17) himself presented further evidence which strengthened his 
conclusions. He employed a set of nonsense materials, 10 of which had 
been presented to subjects and 10 of which had not. Experimenters 
were divided into three groups, One group was correctly informed as 
to which 10 materials had been exposed, another group was incorrectly 


i was told nothing. The results of this 


informed, while the third group 
study also indicated that the materials which experimenters expected to 
fact, more often chosen. 


be more often chosen were, in ? 
An experiment analogous to those just described was conducted in 
a psychophysical laboratory by workers (Warner and Raible, 1937) 
who interpreted their study within the framework of parapsycho- 
logical phenomena. The study involved the judgment of weights by 
subjects who could not see their experimenter. The latter kept his lips 
tightly closed to prevent unconscious whispering (Kennedy, 1938). 
In half the experimental trials, the experimenter knew the correct 
response from a key. Of the 17 subjects, 6 showed a standard error of 
1.0 or more from a 50-50 distribution of errors. All 6 of these subjects 
made fewer errors on trials on which the experimenter knew which 
weight was lighter or heavier. At least for those subjects who were 
somewhat affected by the experimenter’s knowledge of the correct 
response, the authors’ conclusion seems justified. As an alternative tO 
the interpretation of these results as extrasensory perception (ESP) 
phenomena, they suggested the Possibility of some form of auditory 
cue transmission to the subjects. 
Among the most recent studies in the area of ESP are those by 
Schmeidler and McConnell (1958). These workers found that subjects 
who believed ESP possible (“sheep”) performed better at ESP ee 
than did subjects who did not believe ESP possible (“goats”). Thes 
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workers suggested that an experimenter by his presentation, might 
affect subjects’ self-classification, thereby increasing or decreasing the 
likelihood of successful ESP performance. Similarly, Anderson and 
White (1958) found that teachers’ and students’ attitudes toward 
each other might influence performance in classroom ESP experi- 
ments. The mechanism operating here might also have been one of 
Certain teachers’ expectancies which were communicated to the 
children whose self-classification as sheep or goats might thereby be 
affected. The role of the experimenter in the results of ESP research 
has been discussed by Crumbaugh (1959) as a source of evidence 
against the existence of the phenomenon. We file no brief here for or 
against ESP, but suggest that if, in carefully done experiments, certain 
types of experimenters obtain certain types of ESP performances ina 
Predictable manner (as suggested by the studies cited), that further 
evidence for the effects of experimenter outcome-orientation will have 


been adduced (Rhine, 1959). 


In a more traditional area of psychological research—memory— 


Ebbinghaus (1913) called attention to similar experimenter effects. 
In his own research he noted that his expectancy of what data he would 
obtain affected the data he subsequently did obtain. He pointed out, 
furthermore, that the experimenters knowledge of this expectancy was 


Not sufficient to control the phenomenon. This finding has been 
nt researchers in the area. 


unfortunately neglected by many subseque 
: otter possible es has been described by Stevens (1961). He 
discussed the controversy between Fechner and Plateau over the 
Tesults of bisection experiments to determine the nature of the function 
describing the operating characteristics of a sensory system. Plateau 
held that it was a power rather than a log function. Delboeuf carried 
Out experiments for Plateau, but obtained data approximating the 
€chnerian prediction of a log function. Stevens puzzled over these 
results which may be interpreted within the notion of experimenter 
Sutcome-orientation. Either by implicitly expecting the Fechnerian 
Outcomes or by attempting to guard against an anti-Fechnerian bias, 
elboeuf may have influenced the outcome of his studies. ; 
It would appear that Pavlov was aware of the possibility that experi- 
Menter outcome-orientation might affect the results of experiments. 
n an exchange of letters in Science, Zirkle (1958) and Razran (1959) in 
discussing Pavlov’s attitude toward the notion of the inheritance of 
acquired characteristics, gave credence to 4 statement by Gruenberg 
(1929). 
Inan informal statement made at the time of the Thirteenth I nternational 
hysiological Congress, Boston, August, 1929, Pavlov explained that in 
checking up these experiments, it was found that the apparent improve- 
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ment in the ability to learn, on the part of successive generations of mice, 
was really due to an improvement in the ability to teach, on the part of 
the experimenter! And so this“ proof” of the transmission of modifications 
drops out of the picture, at least for the present [p. 327]. 


Wherry? has told of an experiment in which rats were able to discrim- 
inate colors, but only when the experimenter was in the room. 
Christie’s (1951) interpretation of some differences between Iowa 
and Berkeley rats also suggests the possibility of experimenter effects 
associated with his outcome orientation. ; 

But perhaps the best-known and most instructive case illustrating 
the effects of outcome-orientation is that of Clever Hans (Pfungst, 1911). 
By means of tapping his hoof, the horse of von Osten was able to 
spell, read, and solve problems of arithmetic and musical harmony. 
Unlike the owners of other performing animals, Hans’ owner did not 
profit from his animal's talent, and permitted any serious investigator 
to test Hans even in von Osten’s absence. Pfungst, and his colleague 
Stumpf, undertook to discover the secret of Hans’ talents. 

A series of brilliant and painstaking experiments revealed that 
Hans’ questioners cued him unintentionally. A forward inclination of 
the questioner’s head served as signal to Hans to begin his hoof tapping. 
A slight upward motion of the questioner’s head or eyebrows served 
as signal for Hans to stop his tapping. Hans’ amazing talents, then, 
may be viewed as an illustration of the power of the self-fulfilling 
prophecy. Questioners, even skeptical ones, expected Hans to know 
the correct answers to their queries. Their expectation was reflected in 
their signal to Hans that they awaited the cessation of his tapping. This 
signal brought on the expected cessation and Hans was correct again. 

Pfungst aptly summarized the difficulties in uncovering the nature 
of Clever Hans’ talents by speaking of “looking for, in the horse, what 
should have been sought in the man.” 

Turning to a more recent example of possible outcome-orientation 
effects, we will describe an experiment dealing with the Freudian 
defense mechanism of projection (Rosenthal, 1956). A total of 108 
subjects was randomly divided into three groups each receiving 
success, failure, or neutral experience on a task structured as and 
simulating a standardized test of intelligence. Before the subjects’ 
experimental-treatment condition was imposed, they were asked to rate 
the degree of success or failure of persons pictured in photographs. 
Immediately after the experimental manipulation, the subjects were 
asked to rate an equivalent set of photos on their degree of success 


3. Personal communication, 1960. 
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or failure. The dependent variable was the magnitude of the difference 
scores from pre- to postratings of the photographs. It was hypothesized 
that the success-treatment condition would lead to greater subsequent 
perception of other people’s success, while the failure-treatment 
condition would lead to greater subsequent perception of other 
people’s failure as measured by the pre-post difference scores. 

An analysis (which was essentially unnecessary to the main purpose 
of the study) was performed which compared the mean preratings of 
the three experimental-treatment conditions. Preratings by subjects in 
the success-treatment group were significantly lower and less extreme 
than the prerating by subjects in the other conditions. In terms of 
the hypothesis under test, a lower prerating by this group would tend to 
lead to significantly different difference scores if the postratings 
were similar for all treatment conditions. Without the investigator's 
awareness, the cards had been stacked in favor of obtaining results 
confirming the hypothesis under test. It should be emphasized that the 
success and failure groups’ instructions had been verbally identical 
during the prerating phase of the experiment. 

The investigator, however, was aware for each subject which experi- 
mental treatment the subject would subsequently be administered. 


The implication is that in some subtle manner, perhaps by tone, or 
manner, or gestures, or general atmosphere, the experimenter, although 
formally treating the success and failure groups in an identical way, 
influenced the success S's to make lower initial ratings and thus increase 
the experimenter’s probability of verifying his hypothesis [Rosenthal, 
1956, p. 44]. 


Reports of the findings of the sort just presented are not numerous 
and virtually never published. Nevertheless, their occurrence can be 
documented.* Allusions to the effects of the experimenter’s outcome 
orientations in general have been made by Edwards (1950); Feldman 
(1956); Foster (1923); Riecken (1962); Cohen, Silverman, Bressler, 
and Shmavonian (1961); and half facetiously by Ammons and 


Ammons (1957); and Rotter.® 


4. O. Gardebring, 1962; J. Gengerelli, 1956; G. Mount, 1956; and G. 


Rosenwald, 1963; personal communications. 

5. A series of experiments specifically designed to investigate the occurrence 
and nature of the effects of the experimenter's outcome-orientation has 
recently been summarized elsewhere (Rosenthal, 1963). 


6. Personal communication, 1961. 
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human use of human 
subjects: 


the problem of deception in social 
Psychological experiments' 


Herbert C. Kelman 


In 1954, in the pages of the American Psychologist, Bier Yine 
raised a series of questions about experiments—particular Sst: 
area of small groups—in which “the psychologist concea z e ne 
Purpose and conditions of the experiment, or positively misinforms p 
subjects, or exposes them to painful, embarrassing, OT worse, ee 
ences, without the subjects’ knowledge of what is going on [p. E y 
He summed up his concerns by asking, “What ...is the pi pe 
balance between the interests of science and the hoe ae 
of the persons who, innocently, supply the data? [p. 1 1” vee 
effort has been made in the intervening years to seek aiid he 
questions he raised. During these same years, however, the pro 


From Psychological Bulletin, Vol. 67, (No. 1), 1967, pp. 1-11. one eal 

by the American Psychological Association. Reproduced a pi a . ables fn 

1l. Paper read at the symposium on “Ethical and Methodo oe the Amencan 
Social Psychological Experiments,” held at the meenet AEE 
Psychological Association in Chicago, September 3, d Eee change 
product of a research program on social influence a R ch Grant 
supported by United States Public Health Service Resear 


MH-07280 from the National Institute of Mental Health. 
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deception in social psychological experiments has taken on increasingly 
serious proportions.” 

The problem is actually broader, extending beyond the walls of 
the laboratory. It arises, for example, in various field studies in which 
investigators enroll as members of a group that has special interest for 
them so that they can observe its operations from the inside. The 
pervasiveness of the problem becomes even more apparent when we 
consider that deception is built into most of our measurement devices, 
since it is important to keep the respondent unaware of the personality 
or attitude dimension that we wish to explore. For the present 
purposes, however, primarily the problem of deception in the context of 
the social psychological experiment will be discussed. 

The use of deception has become more and more extensive, and it 

. is now a commonplace and almost standard feature of social psycho- 
logical experiments. Deception has been turned into a game, often 
played with great skill and virtuosity. A considerable amount of the 
creativity and ingenuity of social psychologists is invested in the 
development of increasingly elaborate deception situations. Within a 
single experiment, deception may be built upon deception in a delicately 
complex structure. The literature now contains a fair number of 
studies in which second- or even third-order deception was employed. 

One well-known experiment (Festinger and Carlsmith, 1959), 
for example, involved a whole progression of deceptions. After the 
subjects had gone through an experimental task, the investigator made 
it clear—through word and gesture—that the experiment was over 
and that he would now “like to explain what this has been all about 

so you'll have some idea of why you were doing this [p. 205].” This 
explanation was false, however, and was designed to serve as a basis for 
the true experimental manipulation. The manipulation itself involved 
asking subjects to serve as the experimenter’s accomplices. The task 
of the “accomplice” was to tell the next “subject” that the experiment in 
which he had just participated (which was in fact a rather boring 
experience) had been interesting and enjoyable. He was also asked 
to be on call for unspecified future occasions on which his services as 
accomplice might be needed because “the regular fellow couldn't 


2. In focusing on deception in social psychological experiments, I do not wish 


to give the impression that there is no serious problem elsewhere. Deception 
is widely used in most studies involving human subjects and gives rise to 
issues similar to those discussed in this paper. Some examples of the use of 


deception in other areas of psychological experimentation will be presented 
later in this paper. 
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make it . 
uae pa ye had a subject scheduled [p. 205].” These newly 
Decat were ap ices,” of course, were the true subjects, while the 
i he experimenter's true accomplices. For their pre- 
o hem re accomplices, the true subjects were paid in advance 
n S pna $1, and half $20. When they completed their 
aeni A oe investigators added injury to insult by asking 
CA to Tert heir hard-earned cash. Thus, in this one study, in 
tiie: experiment rey the usual misinformation about the purpose of 
Be eicnial sa subject was given feedback that was really an 
really a subject ae was asked to be an accomplice who was 
wisp. One aaron Lis on $20 bill that was really a will-o’-the- 
Where will eon cate ow much further in this direction we can go. 
difficun KA to view this problem with alarm, but it is much more 
a working tonne an unambiguous position on the problem. As 
perimental social psychologist, I cannot conceive the 


issue in a x 
T a absolutist terms. Iam too well aware of the fact that there are 
ns for using deception in many experiments. There are 


man igni 
a problems that probably cannot be investigated 
ment of ou se of deception, at least not at the present level of develop- 
fronted ithe experimental methodology. Thus, we are always con- 
nowledge a conflict of values. Ifwe regard the acquisition of scientific 
ment usin ae human behavior asa positive value, and if an experi- 
knowledge eception constitutes a significant contribution to such 
ences which could not very well be achieved by other means, 
for us is ee unequivocally rule out this experiment. The question 
whether hie simply whether it does or does not use deception, but 
nificance of amount and type of deception are justified by the sig- 
of the study and the unavailability of alternative (that is, 


enone) procedures. 
forexa sank expressed special concern 
ex DEEA the procedure of letting a person i 
serving as th, or as the experimenter s accomplice i ‘ 
etween e e subject. Such a procedure undermines the relationship 
mation oe and subject even further than simple misinfor- 
merely t out the purposes of the experiment; deception does not 
y take place within the experiment, but encompasses the whole 


definiti 
nition of the relationship between the parties involved. Deception 
thin the role of subject for which 


tha 
E Ses place while the person is within { 
the ver ontracted can, to some degree, be isolated, but deception about 
experi y nature of the contract itself is more likely to suffuse the 
menter-subject relationship as @ whole and to remove the 


about second-order deceptions, 
believe that he is acting as 
when he is in fact 
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possibility of mutual trust. Thus, I would be inclined to take a more 
absolutist stand with regard to such second-order deceptions—but 
even here the issue turns out to be more complicated. I am stopped 
short when I think, for example of the ingenious studies on experimenter 
bias by Rosenthal and his associates (e.g., Rosenthal and Fode, 1963; 
Rosenthal, Persinger, Vikan-Kline, and Fode, 1963; Rosenthal, 
Persinger, Vikan-Kline, and Mulry, 1963). These experiments em- 
ployed second-order deception in that subjects were led to believe 
that they were the experimenters. Since these were experiments about 
experiments, however, it is very hard to conceive of any alternative 
procedures that the investigators might have used. There is no question 
in my mind that these are significant studies; they provide funda- 
mental inputs to present efforts at reexamining the social psychology of 
the experiment. These studies, then, help to underline even further the 
point that we are confronted with a conflict of values that cannot be 
resolved by fiat. 

I hope it is clear from these remarks that my purpose in focusing 
on this problem is not to single out specific studies performed by some 
of my colleagues and to point a finger at them. Indeed, the finger points 
at me as well. I too have used deception, and have known the joys of 
applying my skills and ingenuity to the creation of elaborate experi- 
mental situations that the subjects would not be able to decode. I am 
now making active attempts to find alternatives to deception, but still 
I have not forsworn the use of deception under any and all circum- 
stances. The questions I am raising, then, are addressed to myself 
as well as to my colleagues. They are questions with which all of us 
who are committed to social psychology must come to grips, lest we 
leave their resolution to others who have no understanding of what we 
are trying to accomplish. 
What concerns me most is not so much that deception is used, but 
precisely that it is used without question. It has now become standard 
operating procedure in the social psychologist’s laboratory. I some- 
times feel that we are training a generation of students who do not 
know that there is any other way of doing experiments in our field— 
who feel that deception is as much de rigueur as significance at the .05 
level. Too often deception is used not asa last resort, but as a matter of 
course. Our attitude seems to be that if you can deceive, why tell the 


truth? | It is this unquestioning acceptance, this routinization of 
deception, that really concerns me. 


I would like to turn now to a review of t 
with the problem of deception, and then 
approaches for dealing with it. 


he bases for my concern 
suggest some possible 
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Implications of the use of deception 
in social psychological experiments 


My concern about the use of deception is based on three considerations : 
the ethical implications of such procedures, their methodological 
implications, and their implications for the future of social psychology. 


1l. Ethical implications 


Ethical problems of a rather obvious nature arise in the experiments in 
which deception has potentially harmful consequences for the subject. 
Take, for example, the brilliant experiment by Mulder and Stemerding 
(1963) on the effects of threat on attraction to the group and need for 
strong leadership. In this study—one of the very rare examples 
of an experiment conducted in a natural setting—independent food 
merchants in a number of Dutch towns were brought together for 
group meetings, in the course of which they were informed that a large 
Organization was planning to open up 4 series of supermarkets in the 
Netherlands. In the High Threat condition, subjects were told that 
there was a high probability that their town would be selected as a 
Site for such markets, and that the advent of these markets would cause 
a considerable drop in their business. On the advice of the executives of 
the shopkeepers’ organizations, who had helped to arrange the group 
Meetings, the investigators did not reveal the experimental manipula- 
tions to their subjects. I have been worried about these Dutch 
merchants ever since I heard about this study for the first time. Did 
Some of them go out of business in anticipation of the heavy competi- 
tion? Do some of them have an anxiety reaction every time they see a- 
bulldozer? Chances are that they soon forgot about this threat 
(unless, of course, supermarkets actually did move into town) and that 
it became just one of the many little moments of anxiety that must occur 
in every shopkeeper’s life. Do we have a right, however, to add to 
lifes little anxieties and to risk the possibility of more extensive 
anxiety purely for the purposes of our experiments, particularly since 
deception deprives the subject of the opportunity to choose whether o 
not he wishes to expose himself to the risks that might be entailed? 

The studies by Bramel (1962, 1963) and Bergin (1962) provide 
examples of another type of potentially harmful effects arising from 
the use of deception. In the Bramel studies, male undergraduates were 
led to believe that they were homosexually aroused by photographs of 
men. In the Bergin study, subjects of both sexes were given discrepant 
information about their level of masculinity or femininity; in one 
experimental condition, this information was presumably based on an 
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elaborate series of psychological tests in which the subjects had 
participated. In all of these studies, the deception was explained to the 
subject at the end of the experiment. One wonders, however, whether 
such explanation removes the possibility of harmful effects. For many 
persons in this age group, sexual identity is still a live and sensitive 
issue, and the self-doubts generated by the laboratory experience may 
take on a life of their own and linger on for some time to come. 

Yet another illustration of potentially harmful effects of deception 
can be found in Milgram’s (1963, 1965) studies of obedience. In these 
experiments, the subject was led to believe that he was participating in 
a learning study and was instructed to administer increasingly severe 
shocks to another person who after a while began to protest vehemently. 
In fact, of course, the victim was an accomplice of the experimenter and 
did not receive any shocks. Depending on the conditions, sizable 
proportions of the subjects obeyed the experimenter’s instructions and 
continued to shock the other person up to the maximum level, which 
they believed to be extremely painful. Both obedient and defiant 
subjects exhibited a great deal of stress in this situation. The com- 
plexities of the issues Surrounding the use of deception become 
quite apparent when one reads the exchange between Baumrind (1964) 
and Milgram (1964) about the ethical implications of the obedience 
research. There is clearly room for disagreement, among honorable 
people, about the evaluation of this research from an ethical point of 
view. Yet, there is good reason to believe that at least some of the 
obedient subjects came away from this experience with a lower self- 
esteem, having to live with the realization that they were willing to 
yield to destructive authority to the point of inflicting extreme pain ona 
fellow human being. The fact that this may have provided, in Milgram’s 
(1964) words, “an opportunity to learn something of importance about 
themselves, and more generally, about the conditions of human 
action [p. 850]” is beside the point. If this were a lesson from life, it 
would indeed constitute an instructive confrontation and provide a 
valuable insight. But do we, for the purpose of experimentation, 
have the right to provide such potentially disturbing insights to subjects 
who do not know that this is what they are coming for? A similar 
question can be raised about the Asch (1951) experiments on group 
pressure, although the stressfulness of the situation and the implications 
for the person’s self-concept were less intense in that context. 

While the present paper is specifically focused on social psycho- 
logical experiments, the problem of deception and its possibly harmful 
effects arises in other areas of psychological experimentation as well. 
Dramatic illustrations are provided by two studies in which subjects 
were exposed, for experimental purposes, to extremely stressful 
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conditions. In an experiment designed to study the establishment of a 
conditioned response in a situation that is traumatic but not painful, 
Campbell, Sanderson, and Laverty (1964) induced—through the use 
of a drug—a temporary interruption of respiration in their subjects. 
This has no permanently harmful physical consequences but is 
nonetheless a severe stress which is not in itself painful . . . [p. 628].” 
The subjects’ reports confirmed that this was a “horrific” experience for 
them. “All the subjects in the standard series said that they thought 
they were dying [p. 631].” Of course the subjects, “male alcoholic 
patients who volunteered for the experiment when they were told that it 
was connected with a possible therapy for alcoholism [p. 629],” were 
Not warned in advance about the effect of the drug, since this informa- 
tion would have reduced the traumatic impact of the experience.? Ina 
series of studies on the effects of psychological stress, Berkun, Bialek, 
Kern, and Yagi (1962) devised a number of ingenious experimental 
situations designed to convince the subject that his life was actually in 
danger. In one situation, the subjects, a group of Army recruits, were 
actually “passengers aboard an apparently stricken plane which was 
being forced to ‘ditch’ or crash-land [p. 4)" In another experiment, an 
isolated subject in a desolate area learned that a sudden emergency had 
arisen (accidental nuclear radiation in the area, or a sudden forest fire, 
or misdirected artillery shells—depending on the experimental 
Condition) and that he could be rescued only if he reported his position 
Over his radio transmitter, “which has quite suddenly failed [p. 7].” 
In yet another situation, the subject was led to believe that he was 
responsible for an explosion that seriously injured another soldier. As 
the authors pointed out, reactions in these situations are more likely 
to approximate reactions to combat experiences or to naturally 
Occurring disasters than are reactions to various laboratory stresses, 
but is the experimenter justified in exposing his subjects to such extreme 
threats? 

So far, I have been speaking of experiments in w. 
Potentially harmful consequences. I am equally concerned, however, 
about the less obvious cases, in which there is little danger of harmful 
effects, at least in the conventional sense of the term. Serious ethical 
Issues are raised by deception per se and the kind of use of human 


hich deception has 


3. The authors reported, however, that some of their other subjects were 
physicians familiar with the drug; “they did not suppose they were dying 
but, even though they knew in a general way what to expect, they too said 
that the experience was extremely harrowing [p. 632].” Thus, conceivably, 
the purposes of the experiment might have been achieved if the subjects 


had been told to expect the temporary interruption of breathing. 
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beings that it implies. In our other interhuman relationships, most of 
us would never think of doing the kinds of things that we do to our 
subjects—exposing others to lies and tricks, deliberately misleading 
them about the purposes of the interaction or withholding pertinent 
information, making promises or giving assurances that we intend to 
disregard. We would view such behavior as a violation of the respect to 
which all fellow humans are entitled and of the whole basis of our 
relationship with them. Yet we seem to forget that the experimenter- 
subject relationship—whatever else it is—is a real interhuman relation- 
ship, in which we have responsibility toward the subject as another 
human being whose dignity we must preserve. The discontinuity 
between the experimenter’s behavior in everyday life and his behavior 
in the laboratory is so marked that one wonders why there has been so 
little concern with this problem, and what mechanisms have allowed 
us to ignore it to such an extent. I am reminded, in this connection, of 
the intriguing phenomenon of the “holiness of sin,” which characterizes 
certain messianic movements as well as other movements of the true- 
believer variety. Behavior that would normally be unacceptable 
actually takes on an aura of virtue in such movements through a 
redefinition of the situation in which the behavior takes place and thus 
of the context for evaluating it. A similar mechanism seems to be 
involved in our attitude toward the psychological experiment. We tend 
to regard it as a situation that is not quite real, that can be isolated 
from the rest of life like a play performed on stage, and to which, there- 
fore, the usual criteria for ethical interpersonal conduct become 
irrelevant. Behavior is judged entirely in the context of the experiments 
scientific contribution and, in this context, deception—which is nor- 
mally unacceptable—can indeed be seen as a positive good. 

The broader ethical problem brought into play by the very use of 
deception becomes even more important when we view it in the light of 
present historical forces. We are living in an age of mass societies in 
which the transformation of man into an object to be manipulated at 
will occurs “on a mass scale, in a systematic way, and under the aegis 
of specialized institutions deliberately assigned to this task [Kelman, 
1965].” In institutionalizing the use of deception in psychological 


experiments, we are, then, contributing to a historical trend that 
threatens values most of us cherish. 


2. Methodological 
implications 


A second source of my concern about the use of deception is my 
increasing doubt about its adequacy as a methodology for social 


psychology. 
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A basic assumption in the use of deception is that a subject’s 
awareness of the conditions that we are trying to create and of the 
phenomena that we wish to study would affect his behavior in such a 
way that we could not draw valid conclusions from it. For example, 


if we are interested in studying the effects of failure on conformity, we 


must create a situation in which the subjects actually feel that they 
have failed, and in which they can be kept unaware of our interest in 
observing conformity. In short, it is important to keep our subjects 
naive about the purposes of the experiment so that they can respond 
to the experimental inductions spontaneously. 
How long, however, will it be possible for us to find naive subjects? 
Among college students, it is already very difficult. They may not know 
the exact purpose of the particular experiment in which they are 
participating, but at least they know, typically, that it is not what the 
experimenter saysit is. Orne (1962) pointed out that the use of deception 
on the part of psychologists is so widely known in the college popula- 
tion that even if a psychologist is honest with the subject, more often 
than not he will be distrusted.” As one subject pithily put it, “ ‘Psychol- 
Ogists always lie? ” Orne added that “This bit of paranoia has some 
Support in reality [pp. 778-779].” There are, of course, other sources 
of human subjects that have not been tapped, and we could turn to 
them in our quest for naiveté. But even there it is only a matter of time. 


As word about psychological experiments gets around in whatever 
Network we happen to be using, sophistication is bound to increase. 
in the use of deception. 


I wonder, therefore, whether there is any future 1 ) 
If the subject in a deception experiment knows what the experi- 
menter is trying to conceal from him and what he is really after in the 
study, the value of the deception is obviously nullified. Generally, 
owever, even the relatively sophisticated subject does not know the 
exact purpose of the experiment; he only has suspicions, which may 
approximate the true purpose of the experiment to a greater or lesser 
degree. Whether or not he knows the true purpose of the experiment, 
he is likely to make an effort to figure out its purpose, since he does not 
elieve what the experimenter tells him, and therefore he is likely to 


Operate in the situation in terms of his own hypothesis of what is 
involved. This may, in line with Orne’s (1962) analysis, lead him to do 


what he thinks the experimenter wants him to do. Conversely, if 
he resents the experimenter’s attempt to deceive him, he may try to 
throw a monkey wrench into the works; I would not be surprised if 
this kind of Schweikian game among subjects became a fairly well- 
established part of the culture of sophisticated campuses. Whichever 
Course the subject uses, however, he is operating in terms of his own 
Conception of the nature of the situation, rather than in terms of the 
Conception that the experimenter is trying to induce. In short, the 
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experimenter can no longer assume that the conditions that he is 
trying to create are the ones that actually define the situation for the 
subject. Thus, the use of deception, while it is designed to give the 
experimenter control over the subject’s perceptions and motivations, 
may actually produce an unspecifiable mixture of intended and 
unintended stimuli that make it difficult to know just what the subject 
is responding to. 

The tendency for subjects to react to unintended cues—to features 
of the situation that are not part of the experimenter’s design—is 
by no means restricted to experiments that involve deception. This 
problem has concerned students of the interview situation for some 
time, and more recently it has been analyzed in detail in the writings 
and research of Riecken, Rosenthal, Orne, and Mills. Subjects enter 
the experiment with their own aims, including attainment of certain 
tewards, divination of the experimenter’s true purposes, and favorable 
self-presentation (Riecken, 1962). They are therefore responsive to 
demand characteristics of the situation (Orne, 1962), to unintended 
communications of the experimenter’s expectations (Rosenthal, 1963), 
and to the role of the experimenter within the social system that 
experimenter and subject jointly constitute (Mills, 1962). In any 
experiment, then, the subject goes beyond the description of the situa- 
tion and the experimental manipulation introduced by the investigator, 
makes his own interpretation of the situation, and acts accordingly. 

For several reasons, however, the use of deception especially 
encourages the subject to dismiss the stated purposes of the experiment 
and to search for alternative interpretations of his own. First, the 
continued use of deception establishes the reputation of psychologists 
as people who cannot be believed. Thus, the desire “to penetrate the 
experimenter’s inscrutability and discover the rationale of the experi- 
ment [Riecken, 1962, p. 34]” becomes especially strong. Generally, 
these efforts are motivated by the subject’s desire to meet the expecta- 
tions of the experimenter and of the situation. They may also be moti- 
vated, however, as I have already mentioned, by a desire to outwit 
the experimenter and to beat him at his own game, in a spirit of 
genuine hostility or playful one-upmanship. Second, a situation 
involving the use of deception is inevitably highly ambiguous since a 
great deal of information relevant to understanding the structure of 
the situation must be withheld from the subject. Thus, the subject is 
especially motivated to try to figure things out and likely to develop 
idiosyncratic interpretations. Third, the use of deception, by its very 
nature, causes the experimenter to transmit contradictory messages 
to the subject. In his verbal instructions and explanations he says one 
thing about the purposes of the experiment; but in the experimental 
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eee he has created, in the manipulations that he has 
ie fe geen ey covert cues that he emits, he says another 
ene otic ce a a for the subject to seek his own 
a ate argue, then, that deception increases the subject's tendency 
(in arate terms of his private definition of the situation, differing 
=a or systematic fashion) from the definition that the experi- 
Ae rying to impose; moreover, it makes it more difficult to 
am right j r ee the effects of this tendency. Whether or not I 
Gide a this judgement, it can, at the very least, be said that the use 
es does not resolve or reduce the unintended effects of the 
ae nt as a social situation in which the subject pursues his private 
‘he Cone the assumptions that the subject is naive and that he sees 
fis ines an as the experimenter wishes him to see it are unwarranted, 
Bites ae eception no longer has any special obvious advantages over 
= pall ea oon approaches. Tam not suggesting that there may not 
—_ sa, when deception may still be the most effective procedure 
asain oma methodological point of view. But since it raises at least ` 
y methodological problems as any other type of procedure 


d F 
Er we have every reason to explore alternative approaches and to 
nd our methodological inquiries to the question of the effects of 


using deception. 


3. Implicati 
plications for the 

Suture of social á 

Psychology 


z iets concern about the use of deception t 
method Ta for our discipline and combines both the ethical and 
somethin ogical considerations that I have already raised. There is 
as the une disturbing about the idea of relying on massive deception 
discipli asis for developing a field of inquiry. Can one really build a 
E ine on a foundation of such research? 
self-d ce long-range point of view, there is obviously something 
Tese: efeating about the use of deception. As we continue to carry out 
arch of this kind, our potential subjects become more and more 


Sophistica Be 
Phisticated, and we become less and less able to meet the conditions 
uire. Moreover, as we continue 


ntial subjects become 


in F : p 
Ncreasingly distrustful of us, and our future relations with them are 
onfronted with the anomalous 


ik 

eae. to be undermined. Thus, we are ¢ é 

qu umstance that the more research we do, the more difficult and 
€stionable it becomes. 


is based on its long-run 
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The use of deception also involves a contradiction between our 
experimental procedures and our long-range aims as scientists and 
teachers. In order to be able to carry out our experiments, we are 
concerned with maintaining the naiveté of the population from which 
we hope to draw our subjects. We are all familiar with the experi- 
menter’s anxious concern that the introductory course might cover the 
autokinetic phenomenon, need achievement, or the Asch situation 
before he has had a chance to complete his experimental runs. This 
perfectly understandable desire to keep procedures secret goes counter 
to the traditional desire of the scientist and teacher to inform and 
enlighten the public. To be sure, experimenters are interested only in 
temporary secrecy, but it is not inconceivable that at some time in the 
future they might be using certain procedures on a regular basis with 
large segments of the population and thus prefer to keep the public 
permanently naive. It is perhaps not too fanciful to imagine, for the 
long run, the possible emergence of a special class, in possession of 
secret knowledge—a Possibility that is clearly antagonistic to the 
Principle of open communication to which we, as scientists and 
intellectuals, are so fervently committed. 


Dealing with the problem of deception 
in social Psychological experiments 


Ifmy concerns about the use of deception are justified, what are some of 
the ways in which we, as experimental social psychologists, can deal 
with them? I would like to Suggest three steps that we can take: 


1. Active awareness of 
the problem 


I have already stressed that I would not 
tion of deception under all circumsta 
conflict of values with whic i 


deception, in the given case, j y 
answer the question is less important than the fact that we ask it. 
What we must be wary of is the tendency to dismiss the question as 
irrelevant and to accept deception as a matter of course. Active 
awareness of the problem is thus in itself part of the solution, for it 
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makes the use of deception a matter for discussion, deliberation, 
investigation, and choice. Active awareness means that, in any given 
case, we will try to balance the value of an experiment that uses decep- 
tion against its questionable or potentially harmful effects. Ifwe engage 
in this process honestly, we are likely to find that there are many occa- 
sions when we or our students can forego the use of deception—either 
because deception is not necessary (that is, alternative procedures that 
are equally good or better are available), because the importance of the 
study does not warrant the use of an ethically questionable procedure, 
or because the type of deception involved is too extreme (in terms of the 
Possibility of harmful effects or of seriously undermining the experi- 
menter-subject relationship). 


2 P E AE E, 
e Counteracting and minimizing 


the negative effects of deception 


If we do use deception, it is essential that we find ways of counteracting 
and minimizing its negative effects. Sensitizing the apprentice re- 
searcher to this necessity is at least as fundamental as any other part of 
research training. ; f 

In those experiments in which deception carries the potential of 
harmful effects (in the more usual sense of the term), there is an obvious 
requirement to build protections into every phase of the process. 
Subjects must be selected in a way that will exclude individuals who are 
especially vulnerable; the potentially harmful manipulation (such as 
the induction of stress) must be kept at a moderate level of intensity ; 
the experimenter must be sensitive to danger signals in the reactions 
of his subjects and be prepared to deal with crises when they arise, 
and, at the conclusion of the session, the experimenter must take time 
Not only to reassure the subject, but also to help him work through his 
feelings about the experience to whatever degree may be required. 
In general, the principle that a subject ought not to leave the laboratory 


with greater anxiety or lower self-esteem than he came with isa good 
he subject should in 


One to follow. I would go beyond it to argue that t | 

Some positive mite pE o ianed by the experience, that is, he should 
come away from it with the feeling that he has learned something, 
understood something, or grown in some way. This, of course, adds 
Special importance to the kind of feedback that is given to the subject 


at the end of the experimental session. À 
_ Postexperimental feedback is, ofcourse, the primary way ofcounter- 
acting negative effects in those experiments in which the issue is 
deception as such, rather than possible threats to the subject's well- 
e n it is our obligation to give 


being. If we do deceive the subject, the 
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I feel very strongly that to accomplish these purposes, we must keep 
the feedback itself inviolat 


false feedback or pretend to be giving him feedback while we are in 
fact introducing another 
maintain any kind of trus 
there must be no ambig 


over and I shall explain to you what it was all about” means precisely 
that and nothing else. I j 


relationship with our s 
from them. 


3. Development of new 
experimental techniques 


My third and final Suggestion is that we invest some of the creativity 
and ingenuity, now devoted to the construction of 
in the search for alternative experimental tec’ 


be designed to involve the s 
effort with the experimenter. 

Perhaps the most Promising source of a 
approaches are procedures using some Sort of role playing. I have 
been impressed, for example, with the role playing that I have observed 
in the context of the Inter-Nation Simulation (Guetzkow, Alger, 


Iternative experimental 
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Brody, Noel, and Snyder, 1963), a laboratory procedure involving a 
simulated world in which the subjects take the roles of decision- ~ 
makers of various nations. This situation seems to create a high level 
ofemotional involvement and to elicit motivations that have a real-life 
quality to them. Moreover, within this situation—which is highly 
complex and generally permits only gross experimental manipulations 
—it is possible to test specific theoretical hypotheses by using data 
based on repeated measurements as interaction between the simulated 
nations develops. Thus, a study carried out at the Western Behavioral 
Sciences Institute provided, as an extra, some interesting opportunities 
for testing hypotheses derived from balance theory, by the use of mutual 
ratings made by decision-makers of Nations A, B, and C, before and 
after A shifted from an alliance with B to an alliance with C. 

A completely different type of role playing was used effectively by 
Rosenberg and Abelson (1960) in their studies of cognitive dilemmas. 
In my own research program, we have been exploring different kinds 
of role-playing procedures with varying degrees of success. In one 
Study, the major manipulation consisted in informing subjects that the 
€xperiment to which they had just committed themselves would require 
them (depending on the condition) either to receive shocks from a 
fellow subject, or to administer shocks to a fellow subject. We used a 
regular deception procedure, but witha difference: We told the subjects 
before the session started that what was to follow was make-believe, but 
that we wanted them to react as if they really found themselves in this 
Situation. I might mention that some subjects, not surprisingly, 
did not accept as true the information that this was all make-believe 
and wanted to know when they should show up for the shock experi- 
ment to which they had committed themselves. I have some question 
about the effectiveness of this particular procedure. It did not do 
enough to create a high level of involvement, and it turned out be be 
very complex since it asked subjects to role-play subjects, not people. 
In this sense, it might have given us the worst of both worlds, but I 
still think it is worth some further exploration. In another experiment, 
We were interested in creating differently structured attitudes about 


an organization by feeding different kinds of information to two 
n asked to take specific 


groups of subjects. These groups were the 
actions in support of the organization, and we measured attitude 
Changes resulting from these actions. In the first part of the experiment, 
the subjects were clearly informed that the organization and the infor- 
mation that we were feeding to them were fictitious, and that we were 
Simply trying to simulate the conditions under which attitudes about 
new organizations are typically formed. In the second part of the 
experiment, the subjects were told that we were interested in studying 
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the effects of action in support ofan organization on attitudes toward it, 
and they were asked (in groups of five) to role-play a strategy meeting 
of leaders of the fictitious organization. The results of this study were 
very encouraging. While there is obviously a great deal that we need to 
know about the meaning of this situation to the subjects, they did 
react differentially to the experimental manipulations and these 
reactions followed an orderly pattern, despite the fact that they knew it 
was all make-believe. 

There are other types of procedures, in addition to role playing, 
that are worth exploring. For example, one might design field experi- 
ments in which, with the full cooperation of the subjects, specific 
experimental variations are introduced. The advantages of dealing 
with motivations at a real-life level of intensity might well outweigh the 
disadvantages of subjects’ knowing the general purpose of the experi- 
ment. At the other extreme of ambitiousness, one might explore the 
effects of modifying standard experimental procedures slightly by 
informing the subject at the beginning of the experiment that he will not 
be receiving full information about what is going on, but asking him to 
suspend judgment until the experiment is over. 

Whatever alternative approach we try, there is no doubt that it will 
have its own problems and complexities. Procedures effective for 
some purposes may be quite ineffective for others, and it may well turn 
out that for certain kinds of problems there is no adequate substitute 
for the use of deception. But there are alternative procedures that, 
for many purposes, may be as effective or even more effective than 
procedures built on deception. These approaches often involve a 
radically different set of assumptions about the role of the subject in the 
experiment: They require us to use the subject’s motivation to co- 
operate rather than to bypass it; they may even call for increasing the 
sophistication of potential subjects, rather than maintaining their 
naiveté. My only plea is that we devote some of our energies to active 
exploration of these alternative approaches. 
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psychology. Related to these two disciplines is a design called the 
extreme groups design. Investigations using extreme groups, often 
chosen on the basis of personality characteristics, appear to have 
increased in frequency. Feldt’s article provides some much needed and 
helpful guidelines in the use of such designs. Guidelines are provided 
which aid the investigator in (1) selecting what percentage of the avail- 
able population should be used to define the extremes, and (2) deciding 
whether a correlational coefficient should be computed or whether 
extreme group differences should be compared. Feldt compares the 
two approaches and concludes that under certain conditions the 
correlational approach is more powerful while under other conditions 
the extreme group approach is more powerful. 

An interesting and informative paper by Grice compares the 
within-subjects design with the between-subjects design. The general 
point which Grice makes is reflected by the title of his article. Essen- 
tially, he concludes that the kind of relationship one obtains between 
variables may be dependent upon the particular experimental design 
that is used. The second part of Grice’s paper is devoted to the question 
of whether two different measures of behavior can be considered as 
measures of the same theoretical variable. Both within-group and 
between-group correlations are discussed as they relate to the general 
problem. 

Selections 10 and 11 deal with the problem of functional relations 
and the accuracy of these relations when based upon individual 
versus group data. Sidman’s paper describes several problems 
associated with each technique and is critical of functional relation- 
ships based upon group-averaged data. His general conclusion is 
that the mean curve does not reflect the form of individual curves. 
The paper by Estes (Selection 12) examines the problem further and 
comes to a somewhat different conclusion. Estes believes that curves 
based upon averaged data are valuable for the analysis of behavior. 
He concludes that the major problem has not been with averaged 
curves but with our interpretations of them. 

Article 12, entitled N = /, focuses on the behavior of one individual, 
and a case is made for the importance of N = 1 studies. A description 
is given of the conditions under which an N of 1 is appropriate and a 
case is made for the generalizability of data based on one individual. 
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Ceres! 
the two disciplines of 
Scientific psychology’ 


Lee J. Cronbach 


No man can be acquainted with all of psychology today, as our conven- 
tion program proves. The scene resembles that of a circus, but a 
circus grander and more bustling than any Barnum ever envisioned—a 
veritable week-long diet of excitement and pink lemonade. Three days 
of smartly paced performance are required just to display the new 
tricks the animal trainers have taught their charges. We admire the 
agile paper-readers swinging high above us in the theoretical blue, 
Saved from disaster by only a few gossamer threads of fact, and we 
gasp as one symposiast thrusts his head bravely between another's 
sharp toothed jaws. This 18-ring display of energies and talents gives 
Plentiful evidence that psychology is going places. But whither? 

In the simpler days of psychology, the presidential address 
Provided a summing-up and a statement of destination. The President 
called the roll of the branches of psychology—Pralsing the growth 

the delinquent 


of some youngsters, tut-tutting patriarchally over 


From American Psychologist, Vol. 12, 1957, pp. 671-684. Copyright 1957 by 
he American Psychological Association. Reproduced by permission. 


L Address of the President at the Sixty-Fifth Annual Convention of the 
American Psychological Association, New York, New York, September 2, 


1957. 
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tendencies of others—and showed each to his proper place at the 
family table. My own title is reminiscent of those grand surveys, but 
the last speaker who could securely bring the whole of psychology 
within one perspective was Dashiell, with his 1938 address on “Rap- 
prochements in Contemporary Psychology” [15]. My scope must be 
far more restricted. 

I shall discuss the past and future place within psychology of two 


historic streams of method, thought, and affiliation which run through 
the last century of our science, O 


making. Psychology continues to this day to be limited by the dedica- 


Inve he other method of inquiry rather 
than to scientific psychology as a whole. 


The separation 
of the disciplines 


of correlations presented by Nature, e 


€ 4 d | While the experimenter is 
interested only in the variation he himself Creates, the correlator finds 


L. J. Cronbach 107 


Sate in the already existing variation between individuals, 
as £ oups, and species. By “correlational psychology” I do not 
alysis „studies which rely on one statistical procedure. Factor 
ens [23 correlational, to be sure, but so is the study of Ford and 
Benton ] relating sexual behavior to differences along the phylo- 
1c scale and across the cultural spectrum. 
Mie. bt ee virtue of the experimental method is that it brings 
tests FT variables under tight control. It thus permits rigorous 
corfelati hypotheses and confident statements about causation. The 
cont a method, for its part, can study what man has not learned 
Sifice ry can never hope to control. Nature has been experimenting 
the re e eginning of time, with a boldness and complexity far beyond 
a a of science. The correlators mission 1s to observe and 
Sick ize the data from Nature’s experiments. Asa minimum outcome, 
eS agree improve immediate decisions and guide experi- 
cae lon. At the best, a Newton, a Lyell, or a Darwin can align the 
relations into a substantial theory. 
marched S our century of scientific psychology, the correlators have 
SS a under many flags. In perhaps the first modern discussion of 
feni ific method in psychology (1874), Wundt [54] showed how 
e meita psychology” and “ethnic psychology” (i.e. cross- 
sere correlations) supplement each other. In one of the most 
EE nt (1953), Bindra and Scheier [4] speak of the interplay of “experi- 
oe, and “psychometric” method. At the turn of the century, the 
era | names were “experimental” and “genetic” psychology, although 
ilon ee were also beginning to contrast their “general psy- 
gy” with the “individual psychology” of Stern and Binet. 
o 1913, Yerkes made the fundamental point that all the correla- 
tiga Psychologies are one. His name for this branch was “compara- 
Psychology.” 
in its completeness necessarily deals 


A 
lthough comparative psychology 
of infant, child, adult, whether the 


Hi ee materials of the psychology 1 

pee A e human or infra-human ; of animal or plant [!]—of normal 

ike normal individuals; of social groups and of civilizations, there is 

not Be why specialists in the use of the comparative method should 
so distinguished, and, if it seems necessary, labelled [55]. 


nie in advocating research on animals [56], Yerkes is emphatic in 
Pa. the goal as correlation across species. In France, la psychologie 
parée continues to include all of differential psychology; but in 
ee as Beach [2] has lamented, comparative psychology degener- 
€d into the experimental psychology of the white rat and thereby lost 


© power of the correlational discipline. 
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Except for the defection of animal psychologists, the correlational 
psychologists have remained loosely federated. Developmental psy- 
chologists, personality psychologists, and differential psychologists 
have been well acquainted both personally and intellectually. They 
study the same courses, they draw on the same literature, they join the 
same divisions of APA. 

Experimental and correlational psychologists, however, grew 
far apart in their training and interests. It is now commonplace for a 
student to get his PhD in experimental psychology without graduate 
training in test theory or developmental psychology, and the student of 
correlational branches can avoid experimental psychology only a 
little less completely. The journals of one discipline have small 
influence on the journals of the other [14]. Boring even dares to say 
[5, p. 578] that there is a personality difference between the fields: 
the distinction being that correlational psychologists like people! 

Certainly the scientific values of psychologists are sharply divided. 
Thorndike L9, 44] recently asked American psychologists to rate 
various historic personages by indicating, on a forced-choice question- 
naire, which have made the greatest contributions to psychology. 
A factor analysis of the ratings shows two distinct factors (Figure 1). 
One bipolar factor (irrelevant to our present discussion) ranges from 
verbal to quantitative psychologists. The other factor has at one 
pole the laboratory experimenters like Stevens, Dodge, and Ebbing- 
haus, and at the opposite pole those like Binet, May, and Goodenough 
who collect and correlate field data. A psychologist’s esteem for the 
experimenters is correlated —.80 (—1.00, corrected for attenuation) 
with his esteem for scientists who use correlational methods. 

There was no such schism in 1913 when Yerkes stated the program 
of correlational psychology. Genetic psychology and experimenta 
psychology were hard at work on the same problems. Terman 
demonstrated in his 1923 presidential address [43] that the mental 
test was within the tradition of experimental, fundamental research in 
psychology, and had quotations to show that the contemporary 
experimentalists agreed with him. Wells and Goddard, in 1913, had 
been asked to lecture on mental tests within the Holy Temple itself, 


the Society of Experimental Psychologists. And, in 1910, the High 
Priest Titchener had said: Bee ey Ce oE 


Individual psychology is one of the chief witnesses to the value of experi- 
ment. It furnishes the key to many, otherwise inexplicable differences 
of result, and it promises to allay many of the outstanding controver- 
sies... . There can be no doubt that it will play a part of steadily 
increasing importance [46]. 
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tst firm Ping them of scientific importance. 
Stride in the opposite direction: 


He is only nominally the ruler 


I sy 

of ma en we dethrone the stimulus. 

the in do 'ogy. The real ruler of the domain which psychology studies is 

Dini a aal and his motives, desires, wants, ambitions, cravings, 

[45, p A ns. The stimulus is merely the more or less accidental fact... 
- 364]. 
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The personality, social, and child psychologists went one way; the 
perception and learning psychologists went the other; and the country 
between turned into desert. 

During the estrangement of correlational and experimental 
psychology, antagonism has been notably absent. Disparagement has 
been pretty well confined to playful remarks like Cattell’s accusation 
that the experimental psychologist’s “regard for the body of nature 
becomes that of the anatomist rather than that of the lover” [7, p. 152], 
or the experimentalist Bartlett’s [1, p. 210] satire on the testers emerging 
from World War I, “chanting in unaccustomed harmony the words of 
the old jingle 


‘God has a plan for every man 
And He has one for you.” 


Most correlationists have done a little experimenting in the narrow 
sense, and experimenters have contributed proudly to testing work 
under war-time necessity. But these are temporary sojourns in a 
foreign land. (For clear expressions of this attitude, see [5, pp. 570-578 
and 52, p. 24].) 

A true federation of the disciplines is required. Kept independent, 
they can give only wrong answers or no answers at all regarding 
‘certain important problems. It is shortsighted to argue for one science 
to discover the general laws of mind or behavior and fora separate enter- 
prise concerned with individual minds, or for a one-way dependence of 
personality theory upon learning theory. Consider the physical 
sciences as a parallel. Physics for centuries was the study of general 
laws applying to all solids or all gases, whereas alchemy and chemistry 
studied the properties and reactions of individual substances. Chem- 
istry was once only a descriptive catalogue of substances and analytic 
techniques. It became a systematic science when organized quantitative 
studies yielded principles to explain differences between substances and 
to predict the outcomes of reactions. In consequence, Mendeleev the 
chemist paved the way for Bohr the physicist, and Fermi’s physics 
contributes to Lawrence’s chemistry; the boundary between chemistry 
and physics has become almost invisible. 

The tide of separation in psychology has already turned. The 
perceiver has reappeared in perceptual psychology. Tested intelligence 
and anxiety appear as independent variables in many of the current 
learning experiments. Factor analytic studies have gained a fresh 
vitality from crossbreeding with classical learning experiments [e.g 
18, 22]. Harlow, Hebb, Hess, and others are creating a truly experi- 
mental psychology of development. And students of personality have 


a 
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retical sophistication. Discussions of the logic of operationism, inter- 
vening variables, and mathematical models have sharpened both the 
formulation of hypotheses and the interpretation of results. 

Individual differences have been an annoyance rather than a 
challenge to the experimenter. His goal is to control behavior, and 
variation within treatments is proof that he has not succeeded. Indi- 
vidual variation is cast into that outer darkness known as “error 
variance.” For reasons both statistical and philosophical, error 
variance is to be reduced by any possible device. You turn to animals 
of a cheap and short-lived species, so that you can use subjects with 
controlled heredity and controlled experience. You select human 
subjects from a narrow subculture. You decorticate your subject by 
cutting neurons or by giving him an environment so meaningless that 
his unique responses disappear [cf. 25]. You increase the number of 
cases to obtain stable averages, or you reduce N to 1, as Skinner does. 
But whatever your device, your goal in the experimental tradition is 
to get those embarrassing differential variables out of sight. 

The correlational psychologist is in love with just those variables 
the experimenter left home to forget. He regards individual and 
group variations as important effects of biological and social causes. 
All organisms adapt to their environments, but not equally well. 
His question is: what present characteristics of the organism determine 
its mode and degree of adaptation? 

Just as individual variation is a source of embarrassment to the 
experimenter, so treatment variation attenuates the results of the 
correlator. His goal is to predict variation within a treatment. His 
experimental designs demand uniform treatment for every case 
contributing to a correlation, and treatment variance means only error 
variance to him. 

Differential psychology, like experimental, began with a purely 
descriptive phase. Cattell at Hopkins, Galton at South Kensington, 
were simply asking how much people varied. They were, we might 
say, estimating the standard deviation while the general psychologists 
were estimating the central tendency, 

The correlation coefficient, invented for the study of hereditary 
resemblance, transformed descriptive differential research into the 
study of mental organization. What began as a mere summary 
statistic quickly became the center of a whole theory of data analysis. 
Murphy's words, written in 1928, recall the excitement that attended 
this development: 


The relation between two variables has actually been found to be statable 
in other terms than those of experiment ...[Moreover,] Yule’s method 
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of “partial correlation? has made possible the mathematical “isolation” 
of variables which cannot be isolated experimentally ... . [Despite the 
limitations of correlational methods,] what they have already yielded to 
psychology ... is nevertheless of such major importance as to lead the 
writer to the opinion that the only twentieth-century discovery comparable 
in importance to the conditioned-response method is the method of 
partial correlations [35, p. 410]. 

Today’s students who meet partial correlation only as a momentary 
digression from their main work in statistics may find this excitement 
hard to comprehend. But partial correlation is the starting place for 
all of factor analysis. 

Factor analysis is rapidly being perfected into a rigorous method of 
clarifying multivariate relationships. Fisher made the experimentalist 
an expert puppeteer, able to keep untangled the strands to half-a-dozen 
independent variables. The correlational psychologist is a mere 
observer of a play where Nature pulls a thousand strings; but his 
multivariate methods make him equally an expert, an expert in 
figuring out where to look for the hidden strings. 

His sophistication in data analysis has not been matched by 
sophistication in theory. The correlational psychologist was led into 
temptation by his own success, losing himself first in practical predic- 
tion, then in a narcissistic program of studying his tests as an end in 
themselves. A naive operationism enthroned theory of test performance 
in the place of theory of mental processes. And premature enthusiasm? 
exalted a few measurements chosen almost by accident from the 
tester’s stock as the ruling forces of the mental universe. 

In former days, it was the experimentalist who wrote essay after 
anxious essay defining his discipline and differentiating it from 
competing ways of studying mind. No doubts plagued correlationists 
like Hall, Galton, and Cattell. They came in on the wave of evolu- 
tionary thought and were buoyed up by every successive crest of social 
Progress or crisis. The demand for universal education, the develop- 
ment of a technical society, the appeals from the distraught twentieth- 
century parent, and finally the clinical movement assured the 
correlational psychologist of his great destiny. Contemporary experi- 
mentalists, however, voice with ever-increasing assurance their program 
and social function: and the fact that tonight you have a correlational 
Psychologist discussing disciplinary identities implies that anxiety is 
now perched on his windowledge. 


2. This judgement is not mine alone; it is the clear consensus of the factor 


analysts themselves [see 28, pp. 321-325]. 
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Indeed, I do speak out of concern for correlational psychology. 
Aptitude tests deserve their fine reputation; but, if practical, validated 
procedures are to be our point of pride, we must be dissatisfied with 
our progress since 1920. As the Executive Committee of Division 5 
itself declared this year, none of our latter-day refinements or innova- 
tions has improved practical predictions by a noticeable amount. 
Correlational psychologists who found their self-esteem upon contribu- 
tions to theory can point to monumental investigations such as the 
Studies of Character and The Authoritarian Personality. Such work 
does throw strong light upon the human scene and brings important 
facts clearly into view. But theories to organize these facts are rarely 
offered and even more rarely solidified [30; 31, p. 55]. 


Potential contributions of 
the disciplines to one another 


Perhaps it is inevitable that a powerful new method will become 
totally absorbing and crowd other thoughts from the minds of its 
followers. It took a generation of concentrated effort to move from 
Spearman’s tetrad equation and Army Alpha to our present view of 
the ability domain. It took the full energies of other psychologists to 
move from S-R bonds to modern behavior theory. No doubt the 
tendency of correlationists to ignore experimental developments is 
explained by their absorption in the wonders and complexities of the 
phenomena their own work was revealing. And if experimentalists 
were to be accused of narrow-minded concentration on one particular 
style and topic of research, the same comment would apply. 

The spell these particular theories and methods cast upon us 
appears to have passed. We are free at last to look up from our own 
bedazzling treasure, to cast properly covetous glances upon the 
scientific wealth of our neighbor discipline. Trading has already 
been resumed, with benefit to both parties. 

_ The introduction of construct validation into test theory [12] isa 
prime example. The history of this development, you may recall, was 
that the APA’s Committee on Psychological Tests discovered that 
available test theory recognized no way of determining whether & 
proposed psychological interpretation of a test was sound. The only 
existing theory dealt with criterion validation and could not evaluate 
claims that a test measured certain psychological traits or states- 
Meehl, capitalizing on the methodological and philosophical progress 
of the experimenters, met the testers’ need by suggesting the idea of 
construct validity. A proposed test interpretation, he showed, is 4 
claim that a test measures a construct, i.e., a claim that the test score 
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can be linked to a theoretical network. This network, together with 
the claim, generates predictions about observations. The test interpre- 
tation is justified only if the observations come out as predicted. To 
decide how well a purported test of anxiety measures anxiety, construct 
validation is necessary; i.e., we must find out whether scores on the 
test behave in accordance with the theory that defines anxiety. This 
theory predicts differences in anxiety between certain groups, and 
traditional correlational methods can test those predictions. But the 
theory also predicts variation in anxiety, hence in the test score, as a 
function of experience or situations, and only an experimental 
approach can test those predictions. 

This new theory of validity has several very broad consequences. 
It gives the tester a start toward the philosophical sophistication the 
experimenter has found so illuminating. It establishes the experimental 
method as a proper and necessary means of validating tests. And it 
re-establishes research on tests as a valuable and even indispensable 
way of extending psychological theory. 

We may expect the test literature of the future to be far less saturated 
with correlations of tests with psychologically enigmatic criteria, and 
far richer in studies which define test variables by their responsiveness 
to practice at different ages, to drugs, to altered instructions, and to 
other experimentally manipulated variables. A pioneering venture in 
this direction is Fleishman’s revealing work [21, 22] on changes in 
the factorial content of motor skills as a function of practice. These 
Studies go far beyond a mere exploration of certain tests; as Ferguson 
has shown [19, 20], they force upon us a theory which treats abilities 
as a product of learning, and a theory of learning in which previously 
acquired abilities play a major role. 
_ Perhaps the most valuable trading goods the correlator can offer 
in return is his multivariate conception of the world. 

No experimenter would deny that situations and responses are 
multifaceted, but rarely are his procedures designed for a systematic 
multivariate analysis. The typical experimental design and the typical 
experimental law employ a single dependent variable. Even when more 
than one outcome is measured, the outcomes are analyzed and inter- 
preted separately. No response measure, however, is an adequate 
measure of a psychological construct. Every score mixes general 
construct-relevant variance with variance specific to the particular 
measuring operation. It is all right for the agriculturist to consider 
size of crop as the fundamental variable being observed : that is the 
payoff for him. Our task, however, is to study changes in fundamental 
aspects of behavior, and these are evidenced only indirectly in any one 


measure of outcome. 


116 Research Problems in Psychology 


The correlational psychologist discovered long ago that no 
observed criterion is truly valid and that simultaneous consideration of 
Many criteria is needed for a satisfactory evaluation of performance. 
This same principle applies in experimentation. As Neal Miller says in 
a recent paper on experiments with drugs 


Where there are relatively few facts it seems easy to account for them 
by a few simple generalizations... . As we begin to study the effects of 
variety of drugs ona number of different behavioral measures, exceptions 


and complexities emerge. We are forced to reexamine and perhaps 
abandon common-sense cat 


egories of generalization according to con- 
venient words existing in the English language. As new and more 
comprehensive patterns of results become available, however, new and 
more precise generalizations may emerge. We may be able to “carve 
nature better to the i 


joint” and achieve the simplicity of a much more 
exact and powerful science [32, pp. 326-327]. 


ystematic 


results from different tasks or die to classify 
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remarks [20, p. 130; see also 19, p. 100]: “No satisfactory methodolog 
has emerged for describing particular learning tasks, or indicating 
how one task differs from another, other than by a process of simple 
inspection.” We depend wholly on the creative flair of the theorist to 
collate the experiments and to invent constructs which might describe 
particular situations, reinforcements, or injunctions in terms of more 
fundamental variables. The multivariate techniques of psychometrics 
are suited for precisely this task of grouping complex events into 
homogeneous classes or organizing them along major dimensions. 
These methods are frankly heuristic, but they are systematically 
heuristic. They select variables with minimal redundancy, and they 
permit us to obtain maximum information from a minimum of 
experimental investment. A a 

In suggesting that examining treatment conditions as a statistical 
universe is a possible way to advance experimental thinking, I am of 
course echoing the recommendations of Egon Brunswik [6, esp. pp. 
39-58]. Brunswik criticized the Fisherian experimenter for his ad hoc 
selection of treatments and recommended that he apply the sampling 
principles of differential psychology in choosing stimuli and conditions. 
A sampling procedure such as Brunswik suggests will often be a 
forward step, but the important matter is not to establish laws which 
apply loosely to a random, unorganized collection of situations. The 
important matter is to discover the organization among the situations, 
so that we can describe situational differences as systematically as we 
do individual differences. on 

Research on stress presents a typical problem of organization. 
Multivariate psychophysiological data indicate that different taxing 
situations have different effects. At present, stressors can be described 
and classified only superficially, by inspection. A correlational or 
distance analysis of the data groups treatments which have similar 
effects and ultimately permits us to locate each treatment within a 
continuous multidimensional structure having constructs as reference 
axes, Data from a recent study by Wenger, Clemens, and Engel [50] 
may be used as an illustration. Figure 2 shows the means of standard- 
ized physiological scores under four different stress conditions: 
mental arithmetic, a letter association test, hyperventilation, and acold 
pressor. The “profiles” for the four conditions are very significantly 
different. I have made a distance analysis to examine the similarity 
between conditions, with the results diagrammed in Figure 3. There 
is a general factor among all the treatments, which distinguishes them 
from the resting state, and a notable group factor among three of 
them. According to these data, a mental test seems to induce the same 
physiological state as plunging one’s foot into ice water! 
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Mean response to four 
stressors expressed 

in terms of resting 
standard scores 

(data from 50). 


Fig. 3 

Multivariate diagram 
showing similarity between 
four stressors. 
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Much larger bodies of data are of course needed to map the treat- 
ment space properly. But the aptness of an attempt in this direction 
will be apparent to all who heard Selye’s address to the APA last year. 
His argument [40] that all stressful situations lead to a similar syndrome 
of physiological changes is strongly reminiscent of Spearman’s argu- 
ment regarding a general factor linking intellectual responses. The 
disagreement between Selye and other students of stress clearly 
reduces to a quantitative question of the relative size of specific and 
nonspecific or general factors in the effects of typical stressors. 


Applied psychology 

divided against itself 

Let us leave for the moment questions of academic psychology and 
consider the schism as it appears in applied psychology. In applied 
psychology, the two disciplines are in active conflict; and unless they 
bring their efforts into harmony, they can hold each other to a standstill. 
The conflict is especially obvious at this moment in the challenge 
the young engineering psychology offers to traditional personnel 


psychology. } f f 

The program of applied experimental psychology is to modify 
treatments so as to obtain the highest average performance when all 
persons are treated alike—a search, that s ior athe oe best way. 
The program of applied correlational psychology 1s to raise average 
pecfarmariee by eal persons differently—different job assign- 
ments, different therapies, different disciplinary methods. The correla- 
tionist is utterly antagonistic to a doctrine of “the one best way, 
whether it be the heartless robot-making of Frederick Taylor or a 
doctrinaire permissiveness which tries to give identical encouragement 


to every individual. The ideal of the engineering psychologist, Iam 
told, is ts simplify jobs so that every individual in the working popula- 
tion will be able to perform them satisfactorily, i.e., SO that differentia- 
tion of treatment will be unnecessary. This goal guides activities 
ranging from the sober to the bizarre: from E. L. Thorndike and 
Skinner, hunting the one best sequence of problems for teaching 
arithmetic, to Rudolf Flesch and his admirers, reducing Paradise Lost 
toacomic book. Ifthe engineering psychologist succeeds: information 
rates will be so reduced that the most laggard of us can keep up, 


i i : i ee them. 
Visual displays will be so enlarged that the most myopic can see them, 
automatic feedback will prevent the most accident-prone from spoiling 


the work or his fingers. : 
Obviously, with every inch of success the engineer has, the tester 
must retreat a mile. A slight reduction in information rate, accom- 


120 Research Problems in Psychology 


plished once, reduces forever the validity and utility of a test of ability 
to process data. If, once the job is modified, the myopic worker can 
perform as well as the man with 20/20 vision, Snellen charts and 
orthoraters are out of business. Nor is the threat confined to the 
industrial scene. If tranquilizers make everybody happy, why bother 
to diagnose patients to determine which treatments they should have: 
And if televised lessons can simplify things so that every freshman will 
enjoy and understand quantum mechanics, we will need neither 
college aptitude tests nor final examinations. 

It is not my intention to warn testers about looming unemploy- 
ment. If test technology is not greatly improved, long before the 
applied experimentalists near their goals, testing deserves to disappear. 


My message is my belief that the conflicting principles of the tester and 


the experimenter can be fused into a new and integrated applied 
Psychology, 


To understand the present conflict in purposes, we must look again 
at historical antecedents, Pastore [36] argues with much justice that 
the testers and classifiers have been Political conservatives, while those 
Who try to find the best common treatment for all—particularly in 
Cducation—have been the liberals. This essential conservatism O 
Personnel psychology traces back to the days of Darwin and Spencer. 

„The theory of evolution inspired two antagonistic movements at 
Social thought [10, 42]. Darwin and Herbert Spencer were rea 
determinists. The survival of the fittest, as a law of Nature, guaranteed 
man’s Superiority and the ultimate triumph of the natural aristocrats 
among men. As Dewey put it Spencer saw “a rapid transit system © 


evolution - +. Carrying us automatically to the goal of perfect man in 
perfect Society” [ 


ct socie 17, p. 66]. Men vary in their power of nt aaa 
and institutions, by demanding adaptation, serve as instruments © 
natural selection among men. The essence of freedom is seen aS the 
freedom to compete for survival. To Spencer, to Galton, and to their 
successors down to the Present day, the successful are those who have 
the greatest adjustive capacity. The psychologist’s job, in this tradition, 
Is to facilitate or anticipate natural selection. He secks only to reduce 
its cruelty and wastage by predicting who will survive in schools and 
other institutions as they are. He takes the system for granted and 
tries to identify who will fit into it. His devices have a conservative 
influence because they identify persons who will Succeed in the existing 
institution. By reducing failures, they remove a challenge which 
might otherwise force the institution to change [49]. 

The experimental scientist inherits an interpretation of evolution 
associated with the names of Ward, James, and ewey. For them, 
man’s progress rests on his intelligence; the great Struggle for survival 
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is a struggle against environment, not against competitors. Intelligent 
man must reshape his environment, not merely conform to it. This 
spirit, the very antithesis of Spencerian laissez-faire, bred today’s 
experimental social science which accepts no institution and no 
tradition as sacred. The individual is seen as inherently self-directing 
and creative. One can not hope to predict how he will meet his 
Problems, and applied differential psychology is therefore pointless 
[39, p. 37]. Ae 

_ Thus wecometo have one psychology which accepts the institution, 
its treatment, and its criterion and finds men to fit the institution's 
needs. The other psychology takes man—generalized man—as given 
and challenges any institution which does not conform to the measure of 


this standard man. 
A clearer view of evolution removes the paradox: 


The entire significance of the evolutionary method in biology and social 
history is that every distinct organ, structure, or formation, every 
grouping of cells or elements, has to be treated as an instrument of 
adjustment or adaptation to a particular environing situation. Its 
meaning, its character, its value, is known when, and only when, u is 
considered as an arrangement for meeting the conditions involved in 
some specific situation [16, p- 15]. 
We are not on the right track when we conceive of adjustment s 
adjustive capacity in the abstract. It is always a prea repon 
to a particular treatment. The organism which adapts well under one 
condition would not survive under another. If for each ser 
there is a best organism, for every organism there 1s a best pee 3 
The job of applied psychology is to improve decisions a our P oh e. 

he greatest social benefit will come from applied psychology if we an 
find for each individual the treatment to which he can most easily 


adapt. This calls for the joint application of experimental and correla- 
tional methods. 


Interaction of treatment and 

individual in practical decisions 

Goldine Gleser and the writer have recently published at a 
analysis [11] which shows that neither the traditen predictive 
model of the correlator nor the traditional experimenta’ comparison 
of mean differences is an adequate formulation of the eens con- 
fronting the applied psychologist. Let me attempt to give a telescoped 


version of the central argument. 
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The decision maker has to determine what treatment shall be 
used for each individual or each group of individuals. Psychological 
data help a college, for example, select students to be trained as 
scientists. The aim of any decision maker is to maximize expected 
payoff. There is a payoff function relating outcome (e.g., achievement 
in science) to aptitude dimensions for any particular treatment. 
Figure 4 shows such a function fora single aptitude. Average payoff—if 
everyone receives the treatment—is indicated by the arrow. The 
experimentalist assumes a fixed population and hunts for the treatment 
with the highest average and the least variability. The correlationist 
assumes a fixed treatment and hunts for aptitudes which maximize the 
slope of the payoff function. In academic selection, he advises admission 
of students with high scores on a relevant aptitude and thus raises 
payoff for the institution (Figure 5). x 

Pure selection, however, almost never occurs. The college aptitude 
test may seem to be intended for a selection decision ; and, insofar as 
the individual college is concerned only with those it accepts, the 
conventional validity coefficient does indicate the best test. But from a 
societal point of view, the rejects will also go on into other social 
institutions, and their profit from this treatment must be weighed in the 
balance along with the profit or social contribution from the ones who 
enter college. Every decision is really a choice between treatments. 
Predicting outcome has no social value unless the psychologist or the 
subject himself can use the information to make better choices of 
treatment. The prediction must help to determine a treatment for 
every individual. 
Even when there are just two treatments, the payoff functions have 
many possible relationships. In Figure 6 we have a mean difference 
between treatments, and a valid predictor. The predictor—though 
valid—is useless. We should give everyone Treatment A. In Figure 7, 
on the other hand, we should divide the group and give different 
treatments. This gives greater payoff than either treatment used 
uniformly will give. 

Assigning everyone to the treatment with the highest average, as 
the experimentalist tends to recommend, is rarely the best decision- 
In Figure 8, Treatment C has the best average, and we might assign 
everyone to it. The outcome is greater, however, if we assign some 
Persons to each treatment. The psychologist making an experimental 
comparison arrives at the wrong conclusion if he ignores the aptitude 
variable and recommends C as a standard treatment. 


Applied psychologists should deal with treatments and persons 
simultaneously. Treatments are characterized by Many dimensions; 
so are persons. The two sets of dimensions together determine a payoff 
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Payoff 


Fig. 4 

Scatter diagram and payoff 
function showing outcome as a 
function of individual differences. 
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Fig. 7 
Payoff functions for two treatments. 
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Payoff functions for three treatments. 
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tends to have little interaction with treatment, and if so is not the best 
guide to differential treatment. We require a measure of aptitude 
a predicts who will learn better from one curriculum than from 
ha other ; but this aptitude remains to be discovered. Ultimately we 
should design treatments, not to fit the average person, but to fit groups 
of students with particular aptitude patterns. Conversely, we should 
seek out the aptitudes which correspond to (interact with) modifiable 
aspects of the treatment. 

_Myargument rests on the as 
interactions exist. There is, scattered in the lite: 
amount of evidence of significant, predictable differences in the way 
people learn. We have only limited success in predicting which of two 
tasks a person can perform better, when we allow enough training to 
compensate for differences in past attainment. But we do find that a 
person learns more easily from one method than another, that this 
best method differs from person to person, and that such between- 
treatments differences are correlated with tests of ability and per- 
sonality. The studies showing interaction between personality and 
conditions of learning have burgeoned in the past few years, and the 
literature is much too voluminous to review in passing. Just one 
recent finding will serve in the way of specific illustration, a study done 
by Wolfgang Böhm at Vienna [38, pp. 58-59]. He showed his experi- 
mental groups a sound film about the adventures of a small boy and 
his toy elephant at the zoo. At each age level, a matched control group 
read a verbatim text of the sound track. The differences in average 
comprehension between the audiovisualand the text presentations were 
trivial. There was, however, a marked interaction. For some reason 
yet unexplained, a general mental test correlated only .30 with text 
learning, but it predicted film learning with an average correlation of 


-773 The difference was consistent at all ages. i 4 
ii Such finding as this, when replicated and explained, pe x 
nto an educati hology which measures readiness for differen 
iona ee thods to fit different 


types of teaching and which invents teaching metho 
treatment is clearly best for 


types of readiness. In general, unless one 1 
everyone, treatments should be differentiated in such a way as to 
maximize their interaction with aptitude variables. Conversely, 
Persons should be allocated on the basis of those aptitudes which have 


the greatest interaction with treatment variables. I believe we will 
nd these aptitudes to be quite unlike our present aptitude measures 
hin highly correlated treatments. 


Chosen to predict differences wit 


sumption that such aptitude-treatment 
rature, a remarkable 


3. Personal communication. 
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The shape of a 
united discipline 


It is not enough for each discipline to borrow from the other. heres 
tional psychology studies only variance among en h ee 
mental psychology studies only variance among treatmen a tae 
discipline will study both of these, but it will also be eee ae 
otherwise neglected interactions between organismic an een of 
variables [41]. Our job is to invent constructs and to form a pa ee 
laws which permits prediction. From observations we soo RTE 
psychological description of the situation and of the present sta ei 
organism. Our laws should permit us to predict, from this descrip 
he behavior of organism-in-situation. ; i 
; There was AE when experimental psychologists cance raS 
themselves wholly with general, nonindividual constructs, andan 
tional psychologists sought laws wholly within developmental me ie 
More and more, nowadays, their investigations are coming to T a 
the same targets. One psychologist measures ego involvement Rie 
personality test and compares the behavior of high- and low-sco TE 
subjects. Another psychologist heightens ego involvement aes 
mentally in one of two equated groups and studies the consequi al 
differences in behavior. Both investigators can test the same theori 
propositions, and to the extent that their results agree they may 
regard both procedures as embodiments of the same construct. f 

Constructs originating in differential psychology are now rewi 
tied to experimental variables. As a result, the whole theoreti 
picture in such an area as human abilities is changing. Piaget [3 J 
correlates reasoning processes with age and discovers a develop: 
mental sequence of schemata whose emergence permits operationa 
thought; Harlow [24] begins actually to create similar schemata In 
monkeys by means of suitable training. It now becomes possible ie 
pursue in the controllable monkey environment the questions raise 
by Piaget’s unique combination of behavioral testing and interviewing, 
and ultimately to unite the psychology of intelligence with the psy- 
chology of learning. 

Methodologies for a 


some of them, have likewise seen the necessity for a united discipline- 
In the very issue of Psych 


iological Review where the much-too-famous 
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ENN 
1 Organism `^ 


\ at present / 
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1 Predicted \ 
a \ Response 7 
` 7 

Present 

situation 

Fig. 9 


Theoretical model for prediction from historic data. 


~s---- 


distinction between S-R and R-R laws was introduced, Bergmann and 
Spence [3] declared that (at the present stage of psychological knowl- 
edge) the equation R = f(S) must be expanded into 


R = f(S, T, D, I) 


The added variables are innate diffe: 
€xperience—differential variables all. 
laws just as did Wundt, but he added 
must be accounted for. He proposed to do this by changing the 
Constants of his equations with each individual. This is a bold plan, 
but one which has not yet been implemented in even a limited way. 
It is of interest that both Hull [27, p. 116] and Tolman [47, p. 26] have 
Stated specifically that for their purposes factor analytic methods seem 
to have little promise. Tucker, though, has at least drawn blueprints 
of a method for deriving Hull's own individual parameters by factor 
analysis [48]. Clearly, we have much to learn about the most suitable 
way to develop a united theory, but we have no lack of exciting 
Possibilities. ; i 

The experimenter tends to keep his eye on u timate theory. 
Woodworth, ence described psychological laws in terms of the S-O-R 
formula which specifically recognizes, the individual. The revised 
Version of his Experimental Psychology [53, p- 3], however, advocates 
an S-A-R formula, where A stands for “antecedent conditions. 


This formulation, which is generally congenial to experimenters, 
sm to an intervening variable 


reduces the present state of the organi ) e 
(Figure 9). A theory of this type is in principle entirely adequate to 
explain, predict, and control the behavior of organisms; but, oddly 
enough, it is a theory which can account only for the behavior of 
organisms of the next generation, who have not yet been conceived. 


rences, motivation, and past 
Hull [26, 27] sought general 
that organismic factors can and 
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Theoretical network to be developed bh y 


a united discipline. 


o a different ty 


whose life his 


pe of law (Figure 10) whenever 
observed in every detail. A thi 


tory he has not controlled u 
eory which involves only laws of this type: 
iction. 
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e Deni characteristics of the organism, or a combination of the 
Ae epending on what is known. Filling in such a network is clearly 
ask for the joint efforts of experimental and correlational psychology. 
Peis both applied work and general scientific work, psychology 
ae S$ combined, not parallel, labors from our two historic disciplines. 
ae common labor, they will almost certainly become one, with a 
for ant rory; a common method, and common recommendations 
ee etterment. In the search for interactions we will invent new 
a nent dimensions and discover new dimensions of the organism. 
e will come to realize that organism and treatment are an inseparable 
Pair and that no psychologist can dismiss one or the other as error 
variance, 
Oi ite our specializations, every scientific psychologist must 
bets e same scene into his field of vision. Clark Hull, three sentences 
ti te end of his Essentials of Behavior [27, p. 116], voiced just this 
an ecause of delay in developing methodology, he said, individual 
3 rences have played little part in behavior theory, and “a sizeable 
egment of behavioral science remains practically untouched.” This 
Untouched segment contains the question we really want to put to 
ature, and she will never answer until our two disciplines ask it in 


4 single voice. 
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8 


the use of extreme groups 
to test for the 
presence of a relationship 


Leonard S. Feldt 


In the exploratory stages of the investigation of a psychological 
construct, a two-stage experimental design is frequently employed. In 
the first stage, a random sample of subjects from a hypothetical or real 
population is evaluated via a crude measure of the construct, and 
from the distribution of scores which results, an arbitrary definition is 
derived for “High” and “Low” subgroups. In the second stage, the 
high and low groups are exposed to one or several treatment conditions. 
It is hypothesized that if the initial classification was even moderately 
valid, the treatment should produce a different distribution of treat- 
ment criterion scores for the high and low subpopulations. Usually 
such a difference is assessed through a comparison of the means of the 


groups. ; . 

Examples of this design are quite common 1n the psychological 
literature. It was frequently used, for example, in the early studies of 
McClelland’s Achievement Need [7]. It was also extensively employed 
in the preliminary validation of Taylor's Manifest Anxiety Scale [11]. 
In the latter studies the second stage often involved fairly complex 
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and time consuming treatment conditions, such as eyelid conditioning. 
Thus, as is often the case, the use of extreme groups was prompted in 
part by the necessity to limit the total number of subjects. 

An important consideration in such a design is the choice of upper 
and lower percentiles which define the extreme groups. Current 
experimental practice evidences considerable variation in this aspect 
_ of the design. Some investigators have employed extreme tenths or 
fifths, others have utilized upper and lower halves. In all cases the 
decision appears to have been quite arbitrary, or dictated by necessity. 
The primary purpose of this paper is to derive a definition of optimal 
extreme groups for investigations of this kind. A second purpose is to 
compare the efficiency of this design to the correlation approach 
which provides an obvious alternative procedure. 


Definitions and assumptions 


In the following development, the initial classification variable, the 
validity of which is under test, will be designated as X. The criterion 
Measure taken in the second stage of the experiment—strength of 
eyelid conditioning in the Taylor Scale example—will be designated Y. 


Measures X and Y will be assumed to give rise to a normal bivariate 


surface with correlation Pxy in the population of potential experimental 
subjects. The experiment itself consists of obtaining measure X on a 
random sample from the subject population, defining equal extreme 
subgroups on X, imposing treatments conditions and obtaining the 
Yi criterion score on each subject, and finally testing the significance 
of the difference between Y means for the two groups via a t test. 

__ Inthe definition of optimal extreme groups, the criterion employed 
will be the power of the final ¢ test. For a given level of significance the 
power of this test is dependent upon three quantities: (i) the variability 
of the Y measures within each extreme group, (ii) the magnitude of the 
true difference between the Y means; and (iii) the size of the extreme 
pcupe. Each of these factors is functionally related to the percentiles 
chosen to define the groups. It is the nature of linear regression that the 


more 
extreme the larger the difference between the 


: upper and lower halves are used. The 
value of the variance of Y Scores within the subpopulations will also 
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vary with the “extremeness” of the subpopulations. The problem thus 
becomes one of deriving symmetrical upper and lower percentiles 
which will result in a combination of true difference, group size, and 
within-group variance that will yield the most powerful test. 


Optimal definition 

of extreme groups 

For the t test of the difference between means of extreme samples let 
N represent the total number of subjects initially classified, p the 
Proportion of subjects in each extreme group, n = pN the number of 
Subjects in each group, S? the sample variance, and the subscripts 
U and L the upper and lower groups respectively. It should be noted 
that in this situation N is fixed; the problem involves the determination 
of the optimal value of p. The t test for equal groups may be written 
as follows: 


foe el 

Sk + Sdn — 1) 
The power of this test is governed by the parameter Q, 
follows [10]: 


$ = he2 0y. 


In this formula p, is the expected value or mean of a normally dis- 


tributed variable, say w, and Gw is its population standard deviation. 
In the present context (Yy — Y,) is the normally distributed variable, 


(uy — H1) is its expected value, and oy,-7, 18 its population standard 
deviation. Thus 


defined as 


(1) 


b= lku — Hil luu — Lal 4 (2) 


Jory, /2Ao%y + PN 
; a : ae 
The value of ay,, or Gy, iS given in [2] as o7(1 — cp ). In this rele 
ship, o} is the variance of the total population on the a ‘ 
criterion, p is the population value of the linear correlation between 


the classification variable and the treatment criterion, indin is a 
constant dictated by the degree of “extremeness of the upper and lower 


groups. The defining relationship for c is 


Crh Carine (3) 
Oz atal À i 
The second term on the right-hand side is the ratio of the variance 
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within either subpopulation (they are equally variable) to the variance 
for the total population on the classification variable. . 

For the normal surface here assumed, the value of the ratio may be 
computed, by integration by parts, for any selected segment of the 
distribution [4]. Substitution of this result in (3) yields 


ene g (4) 
p? 


where x is the positive or negative standard normal deviate definira 
the extreme groups, and z is the ordinate of the standard norma 
curve at x. It may also be noted that Hy, and py, are defined by the 
Tegression line of Y on X as follows: 


G: 
Hro = Hy + p— (uy, — py); 
: oy 


o 
Hy, = ly + p = (Hx, — Hy). 
x 


The difference thus equals 


Oy 
Hyy — Hy, = p cy = Hx,)- (5) 
a 


By appropriate integration, the values of 


Hx, and uy, may be derived 
‘or any segment of a normal curve [4]. 


In the notation previously 
A ions into the previous equation, 
the difference becomes 


2 > 
Uyu — Hy, = = (6) 


Substitution of results (4) and (6) into (2) and the division of 
numerator and denominator b 


for $: y 2cy yields the following expression 
p= 
ONS A 7) 
Lie B Ep — xz/p) ` ( 
in 


~ TO secure maximum Power in the test of the difference between 
Yy and Y,, p must be Maximized. This is accomplished by appropriate 
choice of x, and of z and p, which are both functionally related to x- 
Since the quantity N is fixed in the Context of this problem, its value 
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Table 1 
Extreme groups which result in mean 
difference tests of maximum power 


Percent in Each 


P Extreme Group 
-10 27.0 
20 26.9 
30 26.6 
40 26.3 
50 25.8 
60 25.1 
-70 24.2 
80 23.3 


may be ignored. The condition for a maximum is obtained by setting 
the first derivative of # with respect to x equal to zero. The process of 
taking this derivative and of solving the resultant equation involves 
some rather tedious algebra, and hence it will not be presented here. 
The end result, however, may be stated quite simply. The maximum 


value of @ occurs when 


2 n 
(i + x? — =) of 22E OO. (8) 
p Zi 

The nature of this relationship makes it difficult to solve analytically 
for x and p, given a specific value of p. However, the comprehensive 
normal curve tables of Kelley [5] permit a sufficiently exact approxi- 
mation for p for select values of p. A number of such pairs of values 
are tabulated in Table 1. ; 

These values indicate that the definition of the optimal extreme 
groups remains remarkably constant over a rather wide range of 
values for p. For p = -10, the function reaches a maximum at p = .27; 
for p = .80, the maximum occurs at p = .23. This finding has impor- 
lant implications for researchers employing the difference approach 
for testing relationships. It suggests that extreme groups of from 
25 to 27 percent provide the most powerful test of the existence ofa 
moderate linear relationship. This size group 1S especially appropriate 
when the relationship is weak, as it usually is in the experimental 


Context here considered. we i 
From a practical point of view, it 1s extremely fortunate that this 
maximum power is reached when groups gisallerthan the Upper ang 
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lower halves are employed. Frequently the nature of the experimental 
treatment or the need to share the pool of experimental subjects forces 
the investigator to use only a fraction of those originally tested. 
These results prove that such a restriction can result in greater, rather 
than less, power in the crucial statistical tests, 


Comparison of difference 
and correlation approaches 


Ifa limited subgroup of subjects may be used to investigate the presence 
of a linear relationship between X and Y, should the experimenter 
draw all of his subjects from the extreme portions of the X distribution 
and test the difference between Y means, or should he draw subjects at 
random from the entire range on X, estimate the linear correlation, 
and test it for significance? The answer to this question is clearly 
pertinent to the development of an adequate research strategy. As in 
the previous discussion, design efficiency will be evaluated by the power 
of the statistical tests that are involved. 


The test of significance of a product-moment correlation or, more 
Precisely, the test of the linear regression coefficient is 


AK erat 9) 
TET j 


| the normally distributed variable; the population 
standard deviation ression coefficient, 


After algebraic simplification, ¢ is found to equal 


B 
= = poy/axy (10) 
V20, V2oy/1 — PN No ox) 
= —Px/No _ 


VVI =p? 
With 2pN subjects this reduces to 


0) — _Px/PN | 11) 
a er í 
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For the diffe : 
Plboeete erence test based on pN subjects per group, the value of } 


ar 
(12) 


Vol = cpP)/N 


i A power of the difference test and the correlation 

e Tato on a easily inferred from the ratio of ha to ¢,. Where 

Ibe tatio ceeds 1.0, the difference test is more powerful; where 
o is less than 1.0, the correlation test is more powerful. 


In general terms, the ratio equals 
ba _ zip = cp”) 
(13) 


b SB 


It 
a he soley that the value of the ratio depends upon both p and 
ther ee of p is, of course, unknown. The value of p, on the 
partially die is partially within the control of the experimenter and 
the aac ictated by the nature of the experiment. In some situations 
Use of no menter might be limited by practical considerations to the 
fathers more than a small proportion of the subjects in the pool. 
r instances, it might be quite feasible to obtain X and Y measures 


on all N subjects. 

ae reveal the conditions under w 

availebalt the ratio was evaluated for var 

availabi a These conditions, which indicate 

100 siete or the second stage of the experi 

ees a (Since the difference technique achieves close to maximum 

baed isi en upper and lower quarters are used, the value of ġa was 

Sack wen p = .25 for all availability percentages above 50.) For 

auld perimental condition the value of p was determined which 
make the ratio greater than 1.0. These results are reported in 


Table 2, 
see data in this table reveal that as the availability percentage 
the ases, the choice of design strategy gradually shifts in favor of 
a correlation approach. However, the advantage of the extreme 
apes design holds until the availability percentage is quite large or 
Gon OT nego correlation is quite high. The two-stage design here 
a sidered would generally be applied in instances where only a 
oderate degree of relationship holds. Values of .50 or higher are 
Probably relatively rare in this preliminary stage of construct definition. 
n view of this fact, a fairly clear-cut recommendation may be made. 


perenne 75 percent of the subject pool can be used in the experiment, 
e difference approach will almost surely be the more powerful. 


hich each approach is the more 
ious conditions of subject 
the proportion of subjects 
ment, range from 20 to 
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Table 2 


Comparative power of difference and correlation 
approaches for testing the presence of relationship 


Percent of Subjects 


Available for More Powerful 
Treatments Approach 
20 Difference when p < .962 
Correlation when p > .962 
30 Difference when p < .938 
Correlation when p > .938 
40 Difference when p < .902 
Correlation when p > .902 
50 Difference when p < .847 
Correlation when p > .847 
60 Difference when p < .768 
Correlation when p > .768 
70 Difference when p < .624 
Correlation when p > .624 
75 Difference when p < .492 
Correlation when p > .492 
80 Difference when p < .198 
Correlation when p > .198 
uoe Than Correlation for 


all values of p 


If more than 80 percent of the subjects can be employed in the second 


portion of the experiment, the correlati ill almost 
5 on approach will a 
certainly be the more powerful. A 


The aboye comparison does not take into account the added 
degrees of freedom of the test o 


greater than 50. Th 


) » the re 
precise for all practical purposes. 


The difference in power of the two approaches may be illustrated 
by an example. Assume P = .30 and that a total of 100 subjects was 
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gal on the classification variable. If the experimenter used 
Redom Won quarters, he would have a t test with 48 degrees of 
WOuld be ith a 5 percent level of significance, the power of this test 
selected > approximate .78. If, on the other hand, 50 subjects were 
Gibducke nn the entire range of the classification variable and the 
level the oment coefficient were tested for significance at the same 
full hen power would equal .57. If 75 subjects were selected from the 
Ifall E on X, the power of the correlation test would equal .75. 
A subjects could be used, the power would equal .87. 
iom oe glance, it may appear paradoxical that throwing away data 
ADDR middle half of the „distribution improves the difference 
e h and renders it superior to correlation analysis based on a 
with oy of subjects. However, these results are consistent 
a ndings of other investigators [1, 3]. Those familiar with 
re eel is analysis procedures will no doubt recognize the analo- 
eh ries ts which hold in that field. Indeed, Kelley’s proof [6] that 
upper nasa is most efficiently assessed through the use of the 
The oe lower twenty-seven percents follows a very similar line. 
Be aoa of the middle portion of the distribution does not merely 
Siva = quantity of the information—a practice which rarely, if 
Paiga f s to benefit of a statistical procedure—but also changes the 
Ginter data. When the full import of the modifications of both 
an y and quality are appreciated, the result no longer seems quite 
omalous. 

Sink should be emphasized th 
is ge to the bivariate norma l 
Oaa ahon of linear correlatio > I 
be Par relationship can be seriously entertained, it would clearly 
the ae o to sample in a fashion which did not permit close study of 
the Ie of the relationship. Thus preference should be given to 

erence approach only when the assumption of linearity 1s 


Strongly tenable. 


at these results apply only to data which 
l distribution. Of particular importance 
n. Where the hypothesis of a 


Estimating the correlation 
m extreme group statistics 


mi foregoing development has been concerne 
Site of the presence ofa linear relationship, no 
om cmi of that relationship. As McNemar [8] 0 

» Such a methodology is almost certain to be abused, for it can 
easily lead the experimenter to exaggerate the importance of trivial 
results, If the distributions of X and Y are normal, or approximately 
So, and if the relationship is linear, a useful approximation of the 


d only with the evalua- 
t with estimation of the 
has succinctly pointed 
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product-moment coefficient may be obtained from statistics derived 
from the extreme groups. 


From (6) the difference between means is seen to equal 
Hy, — Hy, = 2pzoy/p, 


: ‘ nory 
and the variance within either extreme group, as derived in [2], 
equals 


Z= Xz 
ER 


Thus 
Oyy 
oy = ; 
"Jt = plz/p? — xz/p] 
and 
2pz0 
Ktu. — by, = zu 


~ pl = Ple — xep] 


Solving this equation for p yields 


~. Hru — Hy, : 
V427o5,,/p? + (2?/p? — xz/p\iuy,, — m 


(14) 


If the upper and lower quarters are used to test for the presence 
of a relationship, (14) becomes 


= Hru — By, 
6.463004, + (7584)(Hy,, — Ly, > 


Using sample means to estimate Population means and the mean 


Square within groups to estimate the variance within extreme popula- 
tions, the estimate of p becomes 


= -F 
6.4630 MS inin + .7584(¥, — Y 


(15) 


This estimate represents a solution to a special case of the problem 
of estimating p from d 


the more general probl 
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dependence of empirical laws 
upon the source of 


experimental variation’ 


G. Robert Grice 


It is the intent of this paper to discuss two matters of experiente 
design which have arisen in the course of the author’s research. W = 
in some ways unrelated, both are illustrations of a common principie: 
the nature of an observe v8 
upon the nature of the particular experimental design used to obser 
the relationship. Mor Š 
particular source of experimental variation which the design explores. 
This point is not profo! 


that the laws relatin 


Š . i e 
e same experimental variables. Choice of th 
source of experimental 

but should be based u 
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Between-subjects and 
within-subjects designs 
It is quite common to hear an experimenter say, with obvious pride, 
that in this experiment, each subject served as his own control. The 
reasons for the pride are obvious. After all, what more comparable 
control group can there be than the same group? Furthermore, 
the method is efficient and economical, and one has done something 
elegant. However, such an experiment may or may not be a proper 
source of pride, and one very simple point should be raised: a subject 
who has served as his own control may not be the same subject that 
he would have been if he had not. If the experience in the control 
Condition in any way influences performance in the experimental 
condition, then a different result may be obtained than if a separate 
control group had been employed. Such an experiment may be good 
or bad, but if the experimenter thinks that he has merely done, more 
efficiently, the same investigation as the independent group experi- 
ment, he is mistaken. This problem has not gone unnoticed. Solomon 
(1949) and Campbell and Stanley (1963) have proposed designs 
concerned with evaluating the effects of pretesting in experiments 
dealing with such problems as transfer of training, attitude change, 
and teaching methods. The excellent discussion of Campbell and 
Stanley, in particular, should be read by investigators contemplating 
this type of design. The design in which the control group consists of 
the same subjects as the experimental group isa simple instance of the 
more general class of “within-subject” designs which have become 
widely prevalent in psychological research. These designs, of course, 
involve the administration ofa number of experimental treatments or a 
number of values of some experimental variable to the same subject. 
hen certain treatments are administered to separate groups of 
subjects while others are administered to the same subjects, these 
are called “mixed” designs because they contain both within-subjects 
and between-subjects ‘effects. Most modern statistical textbooks 
dealing with psychological research have full treatments of such designs. 
here are two common reasons for obtaining a number of measures 
from the same subject. In the first place, certain variables are inherently 
within-subjects effects. For example, in studying learning as a function 
of practice, it would be absurd to run a separate group of subjects for 
each number of trials if a continuous performance measure hie 
available. In such experiments, we are specifically interested in the 
effect which earlier treatments have on later performance. However, 
the other, and perhaps more common, reason for such procedures is 
Purely statistical and has nothing to do with the scientific logic or 
Purpose of the investigation. The elimination of individual difference 
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variance from differences between treatment means and from ne 
associated error terms may result in a more efficient and less cos 4 
experiment. It is quite natural that such designs should have gre 
appeal to experimenters. r 
a In spite of the obvious advantages of within-subjects gee ae 
their use frequently may lead to incorrect interpretation of ae 
results.? The danger is that the experimenter may believe that e 
design is simply more efficient, but otherwise equivalent to an expe i5 
ment in which the same treatments were administered to Spate 
groups of subjects. The fact is that the reasoning applied Above io 
the simple control-group experiment applies equally to all witi >i 
subjects designs. In spite of the fact that the experimental condoni 
or values of the experimental variables may be the same, the two kieo 
of experiments are not the same, and actually investigate einai 
problems. While it is true that textbook authors may state manon 
must assume that the administration of one treatment has no a e 4 
upon the others, this assumption is often made rather easily and re 
upon no evidence. Frequently, the matter is merely ignored, ante A 
assumption of the equivalence of the designs is implicit rather tha 
explicit. 

Recognition that these two 


types of designs are not equivalent, 
and are likely to yield differen 


; l 
| t relationships among Fp aie 
variables, suggests the Possibility of an additional and potenti 


important kind of experiment, This is one in which the two procedure 
are directly compared. Such an 
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Table 1 
Design for comparison of between-subjects and 
within-subjects experiments 


Treatment 
Condition of 


Administration T; : . T; : ` T 
Betwe 
eit een TA - P naj 5 P Nyg 
i 3 š x i 
g nwi Nyj Nok 


Subjects experiment, and would contain k independent groups of 
subjects, each receiving just one of the treatments. The lower row 
would be made up of subjects who received all of the k treatments. 
As in any two-dimensional design, the between-cells variance of the 
table could be analyzed into three components; however, only two of 
these would be of interest in the present instance. The row effect would 
indicate whether one of the two conditions resulted in overall superior 
Performance. Generally, however, the chief interest would be in the 
interaction. If significant, it would indicate that the differences 
among the treatments depended upon the type of experiment. In the 
case of an ordered independent variable, it would indicate that the 
functions were of different form. In addition to simple significance 
testing, it might be desirable to apply curve-fitting or trend-analysis 
Procedures, The column effect would ordinarily not be of interest in 
this design. 

The analysis of such an experimental design does pose certain 
Problems, If the lower Within row is filled with the same subjects in 
all cells, ordinary analysis-of-variance procedures do not apply since 
the cells in this row are correlated and those based on independent 
groups in the Between condition are not. One solution to this problem, 
which we have used, is to run k groups of Within subjects. The data 
from only one treatment from each of these groups is then used. The 
Within row is then filled with data from independent groups, but is 
still, in effect, a “within” condition because all of the subjects have 
experienced all of the treatments. Statistically, however, the experiment 
may be analyzed in a straightforward manner as an independent 
groups design. This procedure does appear to be wasteful of data, 
because only 1/k of the data for each Within subject is used in the table. 
This is true, however, only for the significance testing, and all data 
may be used for plotting stable within-subjects functions. In the case 
of a quantified independent variable, there is one possibility which 


aie 2 
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might be employed, using all of the data. A function might be fitted 
to the data for one condition and then tested for goodness of fit to the 
other as if it were a rational equation. This procedure is less adequate 
statistically, but if the data points were stable, it could produce a rather 
convincing comparison. This is similar to procedures used by Grice 
and Reynolds (1952) and Newman and Grice (1965) in other contexts, 
and by Kalish and Haber (1963) in the present context. 

The author’s own interest in this problem first occurred in connec- 
tion with the effect of variations in stimulus intensity upon response 
evocation. The interest arose originally, not from methodological 
considerations, but from experimental data obtained at the University 
of Illinois laboratory. In a study of CS intensity in eyelid conditioning, 
Beck (1963) studied the intensity variable as a within-subjects effect. 
Surprisingly, she obtained an effect which was much larger than had 
ever before been obtained. Since all of the previous data came from 
between-subjects experiments, it was concluded that the effect must 
have been produced by the fact that each subject experienced the 
different stimulus intensities. As a result of this reasoning, an experi- 
ment similar to that described above was conducted by Grice and 
Hunter (1964). Two intensities of an auditory CS were used—a 1000- 
cycle tone at 50 or 100 decibels sound-pressure level. Two groups 1? 
the Between condition received only one CS each, either the loud oF 
the soft tone. The Within subjects received both. In an experiment of 
this kind, the question of order of presentation will always arise for 
the Within Condition. In this instance, the subjects were merely 
presented with the two stimuli in an irregular order throughout the 
100 conditioning trials. In some experiments, each treatment woul 
have to be administered all at once, and the order could be randomized 
or counterbalanced unless some considerations dictated otherwise. 
Presumably, the solution would ordinarily be the same as if only the 
Within experiment were conducted. The result of the Grice and Hunter 
experiment was quite dramatic. The difference due to the intensity 
effect obtained under the two procedures was five times as great for 
the Within condition as it was for the Between condition. When 
analyzed by the independent groups method suggested above, the 
interaction term was statistically significant. 

i A second experiment by Grice and Hunter (1964) dealt with signal 
intensity in simple reaction time. Reaction time has frequently been 
found to be related to stimulus intensity and has been discussed 1n 
relation to theoretical interpretations of intensity effects. The experi- 
ment was conducted in essentially the same manner as the previous 
one. Again, although less dramatically, the intensity effect was greater 


for the Within condition. The interaction effect was significant. This 
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time, however, there was a general slowing of response in the two- 
stimulus situation so that the Condition of Administration effect was 
also significant. It appears probable that the uncertainty as to which 
stimulus would occur produced slower reactions, but a greater differ- 
ence between the loud and soft tones appeared, nevertheless. 

Recently, Behar and Adams (1966) obtained a similar result with 
foreperiod signal intensity in reaction time. Using three intensities 
of the ready signal, they found reaction time to be a significantly 
decreasing function of intensity for a within-subjects condition, 
but not for a between-subjects condition. However, their design did 
not provide for a test of the significance of the interaction. 

These findings concerning stimulus intensity effects have made 
necessary some readjustments in our theorizing concerning the opera- 
tion of this variable. For example, Hull's (1949) theory of the stimulus 
intensity dynamism, which simply assumed that dynamogenic effects 
were a function of the amount of stimulus energy, was shown to be 
inadequate. Grice and Hunter suggested that concepts such as 
adaptation level or contrast were necessary to describe the phenomena 
more fully. Additional experiments of the within-subjects type have 
been undertaken to further investigate these ideas. Another result of 
this second experiment was that it cast considerable doubt on the 
adequacy of the generalization theory of intensity effects proposed 
by Perkins (1953) and Logan (1954). This, too, has led to further 
investigation (Grice, Masters, and Kohfeld, 1966). The general 
Outcome of the Grice and Hunter experiments was that the stimulus 
intensity variable is of considerably more interest as a within-subjects 
than as a between-subjects effect. This, in turn, is leading to more 
within-subjects experiments—not because of statistical considerations, 
but because of the greater interest in these relationships. It is suggested 
that this may frequently turn out to be the case for other variables. 

The area of stimulus generalization is one in which investigators 
have for some time been aware of the potential difference between 
these two classes of experimental designs, although the problem has 
not been phrased in just this way. The problem arises because it 18 
Necessary to test at several stimulus values in order to determine a 
8eneralization gradient. Some years ago, Grice (1951) said: 
surement of generalization may be 


Strict] 2 lid 
y speaking, only one valia mea $ 
made for a particular subject. This is because of possible effects of one 


test tri trials are reinforced, such reinforce- 
al o nes. If test tri d, sui 7 
ae esta i generalization gradient. 


ment would have the effect of extending the gi J t 
On the other hand, if T test trials are not reinforced, the picture is 
complicated by the effects of differential reinforcement [p. 151]. 
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Differential reinforcement is See aan ing o R 
radient (see, e.g, Raben, . Both ` i is 
an been ea Typical of the within-subjects ‘carga be 
the well-known study of Hovland (1937) in which the ne cis ae 
tested at all tones with order counterbalanced. Counter a a B 
order, however, does not alter the fact that the subjects were on 
to all stimuli during testing. Typical of the between-subjects app Sam 
is the study by Grice and Saltz (1950) in which separate em ‘hat 
tested to each stimulus. In spite of excellent reasons for think ae ia 
the two kinds of experiments should yield different functio 5 nly 
relation between the two procedures has never been een 
investigated. Wickens, Schroder, and Snide (1954) did do a par at 
replication of the Hovland (1937) pitch generalization. CHED 
using independent groups rather than the counterbalancing mer to 
They obtained a convex generalization gradient as oppos eii 
Hovland’s concave functions. It seems likely that this differen e 
dependent on the difference in experimental design, but one car 


i x 3 A atory 
be certain without direct comparison under identical labora 
conditions. 


; erant 
The introduction by Guttman and Kalish (1956) of their opera 
conditioning techniqu 


is on 
es for studying generalization suggested, © 
logical grounds, that this within-subjects procedure TBU A e 
results nearly equivalent to a comparable between-subjects proc 
They explained as follows: 
The obtaining of generali; 


, ; a o< in this 
zation gradients for individual S's Ù 
experiment is an outcom 


5 ennittit 
e of the fact that aperiodic renforcer a 
greatly increases resistance to extinction, such that the inira ue a 
of a test stimulus during extinction reduces the response strength b) 
small fraction of its total extent .. - [p. 80]. 


One direct comparison has actuall 


A ure 
y been made between this proced 
and an experiment in wh 


: ach 
ich separate groups were tested et 
stimulus. Hiss and Thomas (1963), studying wave-length generali 


i i h A one 
in the pigeon, used three separate groups which were trained Cize 
stimulus and then individually tested at the CS and two gene 
stimuli. The function o 


z se 
btained in this way was compared with oa 

from a single group tested on all three stimuli as done by Gate ared. 
Kalish. Gradients of four different response measures were gon iois 
The method of comparison requires comment: Aware of the pro pro- 
involved, they devised a method which they hoped would be an a at 
priate solution. For the Between condition, triplets were selecte let, 
random with one value for each of the three stimuli. For each eae 
an index of steepness of slope was obtained by computing the po ures, 
age of total responses made to the CS. In the case of latency meas 
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the percentages were baged\6n time-rather than number of responses. 
Percentage values were z also obtained for the Within condition, appar- 
ently based on individual animals although this is not entirely clear. 
The authors regarded this test as “conservative” on the grounds that 
the triplets are more variable than a single animal. This ignores, 
however, the fact that the gradient points for the Between condition 
include individual difference variance, and any appropriate test must 
take this into account. The-concept of “conservatism” is often of 
dubious validity when applied.to_a statistical test. Presumably, Hiss 
and Thomas meant conservative with respect to the commission 
of a Type I error. In this instance, however, a Type II error would be 
at least as serious, since the chief interest appeared to be in establishing 
the equivalence of the two procedures. In any case, the percentages 
were compared by means of the Mann-Whitney U test. There is at 
least one question concerning the applicability of this test to these 
data. Each random trial was based on data for three subjects, and data 
from each of these subjects was represented in two other trials. This 
means that a complex set of correlations would exist between the 
Percentages, thus violating the assumption of independence underlying 
the U test. The degree to which this violation was important is difficult 
to evaluate. 

If the conclusions to be drawn from the Hiss and Thomas study 
are correct, they are of considerable interest. For latency of the first 
response, and for number of responses in the first 30-second test, the 
Slopes were significantly steeper for the Between condition than for 
the Within condition. The difference was also in this direction for 
rate of responding on the first five test trials, but did not reach 
Significance. The reason for interest is that the outcome Is the direct 
Opposite of what would ordinarily be expected, if extinction test trials 
Were to steepen the gradient. This unexpected finding is not readily 
Predictable from existing theory, and would not have been discovered 
without such an experiment. It clearly deserves further investigation. 
That this finding will turn out to be typical, however, 1s to be doubted. 
Kalish and Haber (1963) have reported a wave-length generalization 
Study in which each subject was tested at only one value. The between 
Subjects gradient was flatter than the within-subjects function obtained 


in igi ish (1956) experiment. p 
E oneal Citara an Keisi Fa the Po that within-subjects 


The pr discussion began w1 I 
Siepectinentel deste in which two or more experiment naman 
are applied to the same subjects are not equivalent torea ae 
each subject receives only one of the treatments. [his x $ a 
though the experimental conditions may be otherwise identica 3 
Was then suggested that it may frequently be oe o even n y 
to conduct experiments in which the two procedures are directly 


Pre" 
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compared. It turns out that such experiments may be of considerable 

substantive scientific interest, and may lead to new discoveries and 

to advancement in the understanding of the phenomena. While the 

examples given were limited to the areas of stimulus intensity and 

stimulus generalization, it is suggested that this is a matter of general 
- importance and has wide implications for behavioral research. 


Between-conditions and 
within-conditions correlation 


The point to be made in this section has one principle in common with 
that made in the first section. The nature of an observed empirical 
relationship will depend upon the source of variance used by the 
investigator to look for it. Here, however, we begin with a particular 
use of the correlation coefficient. In psychological theory, the question 
arises as to whether two response measures may be regarded as indi- 
cants of a single theoretical variable. One obvious approach to this 
problem is to examine the correlation between the two measures. 
high correlation would tend to support the view that they are deter- 
mined by the same underlying variable, while a low correlation woul 
indicate that both may not be measures of the same process. For 
example, this question has arisen in connection with the Hullian 
concept of reaction potential, which Hull (1949) conceived as deter- 
mining several measures of response strength. From time to time the 
validity of this construct has been questioned on the grounds that 
correlations obtained between the response measures have not been 
satisfactorily high. The suggestion is made here that the particular 
kind of correlation usually used for this purpose does not provide 4 
satisfactory basis for such an evaluation. 

When a correlation is to be computed for a sample consisting 
of a number of subgroups, the total correlation may be partitione! 
into two components—between-groups and within-groups. This 1$ 
Strictly analogous to the partition of variance in an ordinary analysis 
of variance. The subgroups may be selected on the basis of some 
criterion, or may be randomly selected groups receiving different 
experimental treatments. The within-groups correlation is an 
average” correlation within the groups; the between-groups CO! 
relation is the correlation of group means for the two response 
measures. These two correlations are independent. To put ee 


3. A readily available reference to the logic and computations involved in the 


Partitioning of covariance is to be found in Lindquist (1953, Chapter 14) 


a 
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another way, the within-groups correlation is a measure of covaria- 
tion dependent on individual differences, while the between-groups 
Correlation is a measure of the covariation produced by the experi- 
mental treatment. With the matter clarified in this way, one should 
now raise the question as to what kind of a relation it is proper to 
consider for any particular purpose. It seems clear that a major 
determining factor should be the nature of the theoretical concepts 
under study. If the concepts are conceived as relatively stable traits of 
individuals, then it would appear that the within-groups correlation 
would be most appropriate since it maximally reflects individual 
difference variance under constant experimental conditions. If, on 
the other hand, the concepts are designed to predict the effects of 
experimental variables upon behavior, the between-groups correlation 
should be more appropriate, because it indicates the variation resulting 
from manipulation of these variables. The correlations between 
various response measures in learning which have typically been 
reported (Kimble, 1961) have been based on a single experimental 
Condition following a given amount of training. This is, they have 
been within-condition correlations and no covariation attributable to 
manipulation of experimental variables has been included. 


The first demonstration of the potential significance of this type 


Of reasoning came from an analysis made by Grice (1956) of stimulus 
generalization data collected by Grice and Saltz (1950). The experiment 
was a study of size generalization in the.rat and was composed of nine 
experimental groups. The group means yielded orderly gradients 
indicating varying amounts of generalization decrement. The measure 
reported by Grice and Saltz was the number of responses in extinction 
to the test stimuli. Speed of the first test trial, not included in the 
original report, was reported in the analysis by Grice. The total 
Correlation between these two measures was then analyzed into two 
Components. While the within-groups correlation was only .10, the 

€tween-groups correlation was 'g9, A second analysis of this kind 
has been reported by Newman and Grice (1965). This was also a size 
generalization experiment including the additional variable of drive 
level. The experiment was designed to test theoretical predictions 


Concerning the effect of drive on generalization gradients. There were 
tested under 12 and 48 hours 


Our generalization test stimuli which were test 
ht independent groups with 


Of food deprivation. Thus there were eig 
privation. Thus by Grice and Saltz. In this 


the same two response measures used 

instance, the within-conditions correlation was .22, but the between- 
Conditions correlation was .99. In spite of the fact that this correlation 
Clearly indicates the high degree of linearity, it is still instructive to 
examine the scatter plot of the eight pairs of group means presented in 
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Fig. 1 
60 F Between-conditions 
relation of extinction 
eo OF a and speed. (Data 
2 from Newman and 
9 40+ Grice, 1965.) 
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A al N ae: 
Pee ae circles indicate the groups tested under 48 hours 
The difference x K e rs hollow circles are for the 12-hour groups. 
E Ans: Gaara s of these two conditions indicates the effect of 
indicates varyi d he difference among the points for each drive 
number of a a a stimulus generalization. There might be 4 
obvious a ible explanations for this relationship. However, an 
response i aici interpretation would be that the two 
which, in turn, is inflneg anly related to a single theoretical state, 
Of course, this is th ise the independent, experimental variables. 
behave. There are oe thereactio potential construct is supposed to 
in the size of the aeara things which may be said about the difference 
first place tral "td “iip obtained from these two sources. In the 
conditions ahertle and testing a group of subjects under identical 
systematic indivi tend to make the group homogeneous or reduce 

atic individual differences in reaction potential. It also seems 


a ona, one would ordinarily not be interested in e 
simply describes the relat ahin: nm i ee But in a raeng ye 
data and those in subs vonsiup.. For this reason, the line fitted to ee 
the'line whieh min} Sequent examples are mutual regression lines. This is 

ich minimizes the sum of the sums of the squared residuals for 


the two variables í x 
when both variables are sc i iati ts. 
à aled in stand: eviation uni 
The equation for such a line is: andara te 
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[ Fig. 2 
o ae 
ai Between-conditions 
relation of extinction 
g il and initial speed. 
£ j (Data from Perrin, 
a 1942.) 
g 20} z 
£ 
e 
2 L 
2 
= O 
fh] 10} e 16 trials 
o 3 hours deprivation 


L 1 J a L L J 
0 100 200 300 400 
Initial Speed 


likely that there may be large individual differences in the response 
measures themselves which are unrelated to reaction potential. For 
example, some subjects probably are inherently faster responders 
than others. Moment to moment oscillation in reaction potential will 
also serve to reduce the within-subjects relationship. In the treatment 
means, on the other hand, the contribution of individual differences 
Variance is reduced by a factor of 1/n. The main point about the 
between-treatments relationship, however, is that the experimental 
treatments introduce systematic variation in reaction potential. In 
analysis-of-variance terms, this is a fixed effect rather than a random 
effect. Since the concern of the theory is with the effect of the experi- 
mental variable on reaction potential, it seems clear that the between- 
treatments effect is the one to examine if one wishes to ascertain 
Whether both response measures are indicants of the theoretical state. 
Basically, the question reduces to whether or not the measures yield 
Similar S-R laws when experimental variables are manipulated. ; 

_A similar example of the relationship between ENE 
extinction and response speed is a set of data reported by penn . 
- he experiment was an attempt to determine reaction potents n 
Joint function of amount of training and degree of hunger. Five a 
of initia] training and four levels of food deprivation were emp pie 
M a discrete trial bar-pressing situation. The measure usua eon: 
Or this study is resistance to extinction, but initial latency al F 
were also reported. The latency measures have been etre. 2 o 
Speed, and the between-conditions relation between speed and number 


of extinction trials is presented in Figure 2. Variation attributable to 


ye 
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the deprivation variable is indicated by the hollow circles, and that 
attributable to the number of training trials is indicated by the filled 
circles. It may be seen that these points are fairly well-indicated by a 
linear function. The between-conditions linear correlation is .94. 
Data for computation of a within-conditions correlations were not 
presented, but it is probably safe to assume that it was low. These 
data, plus those of the previous studies, suggest that there exists a 
domain in which resistance to extinction and response speed do 
measure the same thing, and that this underlying state may be manipu- 
lated by amount of training, level of deprivation, and stimulus similarity. 

In the above examples in which a total correlation was partitioned 
into within- and between-conditions correlations, the two response 
measures were obtained from the same subjects. It should be pointed 
out, however, that a between-conditions correlation is still meaningful 
even though the two response measures were obtained from different 
subjects. There are various reasons why this might be desirable or 
necessary. In the first place, the nature of the measures might be 
such that it is impossible or inconvenient to obtain both. Another 
possibility is that obtaining one measure might invalidate a second 
to be taken later. A third situation is one in which there is another 
experimental variable in addition to the one over which the correlation 
is to be obtained. It may be that the same response measure is not 
available in all states of this additional variable. An example of this is 
to be found in an experiment by Grice (1949). This experiment was 2 
comparison of visual discrimination learning in the rat with simultane- 
ous and Successive presentation of stimuli. In the situation in which the 
two stimuli are presented simultaneously, learning is measured by the 
percentages of choices of the positive stimulus. In the situation in 
which only one stimulus is presented on a trial, learning is measured by 
the increasing difference in latency of response to the positive and 
negative stimuli. Theoretical considerations provide the rationale by 
means of which these two measures may be related. The percentage- 
correct measure may be regarded as a function of the difference in 
reaction potential between the positive and negative stimuli, or as 4 
measure of overlap between the two reaction potential distributions. 
This measure was presented for successive blocks of 10 trials. A similar 
measure of overlap in reaction potential can be obtained from the 
latency measures from the successive condition. For each block of 10 
trials, the number of times that latency to the positive stimulus was 
faster than a response to the negative stimulus was obtained. This 
turns out to be the familiar U statistic. A percentage value may be 
obtained by determining the percent this value is of the total number of 
Opportunities for response to the positive stimulus to be faster 


G. R. Grice 157 


100 


1 


80 H 


Paired: % Correct 
—— =F 


Q 
o 
T 


40 fi i 1 -l S = 4 
40 60 80 100 


Successive: % Faster to Positive 


Fig. 3 
Between-conditions relation between percent correct choices for paired 
presentation and percent faster responses to positive stimulus for single 
presentation. (Points are for blocks of 10 trials during learning. Data 


from Grice, 1949.) 


(U/U max x 100).5 The choice measure is plotted as a function of the 


measure derived from latency in Figure 3. The correlation over the 
10-trial blocks was .97. Under the assumption, on theoretical grounds, 
that the two measures are approximately comparable, the linear 
function with a slope of about 45 degrees indicates that the level of 
learning in these two situations was about the same at all levels of 
practice. A slope other than 45 degrees or departure from linearity 
would indicate differences in rate of learning or differences in form of 
learning functions. i i 
Miller (1959) has suggested a type of analysis which uses what is 
essentially the present type of reasoning. He was concerned with the 
legitimacy of the introduction of intervening variables and suggested 
that the only efficient use of such theoretical entities is when multiple 
experimental variables and multiple response measures are employed. 
Subsequently, Miller (1961) presented a set of data illustrating the 
reasoning. Three experimental variables which might be presumed to 


ure reported in the original paper, but it is closely 


5. This is not the same meas = 
mewhat superior. 


related to it and is believed to be so! 
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influence thirst in the rat and three response measures which might be 
presumed to measure it were employed. The first independent variable 
consisted of four amounts of predrinking, varying from 0 to 15 
milliliters. The second variable consisted of 15 milliliters of water 
injected directly into the stomach by means of a fistula. The third was 
the filling of a balloon in the stomach with 15 milliliters of water. The 
three response measures were the subsequent amount of water intake, 
the concentration of quinine in the water required to stop drinking, and 
bar pressing on a VI schedule rewarded by water. Miller presented the 
data in a series of bar graphs, but they lend themselves especially well to 
the kind of analysis used here. A plot of the between-conditions relation 
between water intake and the quinine measure is presented in Figure 
4A. The between-conditions correlation is .97, and the strong linear 
relation suggests that the two measures reflect a single underlying state 
—Ppresumably thirst. One would also conclude that the balloon had 
little, if any, effect on thirst; and that the fistula load did reduce thirst, 
but less than an equal amount of water taken by mouth. In Figures 4B 
and C, bar pressing is plotted as a function of water intake and the 
quinine score. Here the picture is quite different. It appears that bar 
Pressing is a good measure of thirst only when it is manipulated by pre- 
drinking. The between-conditions correlation for this variable alone 
is .98 with water intake, and .99 for quinine. However, the inclusion 
of the other two conditions reduces both correlations to .52. It is 
clear that the stomach balloon and, to a lesser extent, the fistula load, 
reduce bar pressing to a level below what would be predicted on the 
basis of level of thirst. Miller rightly pointed out that had bar pressing 
alone been used, there could have been the erroneous conclusion that 
the balloon reduced thirst. He suggested that an additional variable, 
such as pain, is operating with the bar pressing measure. One additional 
point that this analysis makes clear is that Miller’s exercise would have 
been more elegant had he included additional values of the fistula and 
balloon variable. If these could be added to the two graphs of Figures 
4B and C, they would provide additional functions leading to a fuller 
understanding of the relationships. 

More recently, Stricker and Miller (1965) added an additional 
measure of thirst consisting of licking an empty drinking tube only 
rarely containing water. This measure has the advantage of not 
Satiating the animal. Over six values of the predrinking variable, 
the between-conditions correlation with the intake measure was .998: 
The within-conditions correlation was .31. It was possible to compute 
both of these coefficients from data presented in the paper. g 

In the examples presented here, the product-moment correlation 


has been used as an index of the degree of the between-conditions 


Quinine Score 


Bar Presses 


Bar Presses 


14 


= 
N 


~] 


120 


100 


120 


100 


80 


60 


G. R. Grice 159 


i a Fig. 4 
NM Between-conditions 
relations between 
te presumed measures of 
L thirst. (Data from 
° a Oml water Miller, 1961.) 

r « 5ml water 

a 10 ml water 
li a 15 ml water 
lee o Fistula 

e Balloon (A) Quinine score and 

ys gt tt water‘intake: 
8 10 12 14 16 18 
Water Intake, ml 


o (B) Bar pressing and 


fy l ea Omeier intake. 


10 12 14 16 18 
Water Intake, ml 


(C) Bar pressing and 
ı ıı 5 quinine score. 


10 12 14 16 
Quinine Score 


160 Research Problems in Psychology 


relationship. These correlations frequently turn out to be substantially 
higher than those usually encountered in individual difference work. 
This should not come asa great surprise, however, since the covariation 
in these examples contains an effect due to systematic, and presumably 
strategic, manipulation of experimental variables, rather than being 
entirely dependent on sampling. An additional implication of this 
is that the significance of these correlations may not be appropriately 
tested by the usual method. However, analysis-of-variance methods 
may be adapted to obtain tests of the significance of linear regression 
and of departure from linearity. In the case of small, but systematic, 
departures from linearity, it appears that one might frequently be 
more interested in the “eyeball” test of goodness of fit than in statistical 
significance. In the case of monotonic but curvilinear relations, one 
might be interested in the use of transforming functions which could 
become statements within a theory. Ideally, it would be most satis- 
factory if such functions could be rationally derived from theoretical 
considerations. Another point which should be made about these 
correlations is that their value is specific to a particular experiment, 
since their size will depend upon the range of the experimental variable 
peas Further, it should be noted that between-conditions 
ee oot not espeent of sample size for the treatments, 
os ependence of the variability of the mean on sample 
ey, it is emphasized that the main point is not the use of 
Fe cent tn ae The point is that it is important to determine 
Ero RA two or more response measures yield similar laws 
never be obtain af Nes ae SAN a stent 
ae ed Irom studies of individual differences under constant 
perimental conditions. In more general terms, the problem may be 
stated as comparing the nature of the laws into which different response 
measures enter. For, in the long run, the conditions under which 
Measures are not related in a simple fashion are of as much significance 
as those in which they are. Analysis of this kind appears to provide 4 
useful approach to theory development. 
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a note on functional 
relations obtained from 
group data 


Murray Sidman 


The empirical determination of functional relations between behavior 
and its controlling variables forms a large part of modern behavioral 
research. One important aspect of this type of experimentation 1s 
the method of distributing subjects among the various points which 
determine an empirical curve. 

The most direct method is to use a single organism, and the same 
Organism, to obtain every point on the curve. This procedure is not 
always practicable, however, for one or both of two reasons. 


1. Intra-organism variability may be so great as to obscure any 
lawful relation. It is sometimes possible to avoid this problem by 
taking several determinations at each point and using a statistical 
Measure, a common technique in obtaining threshold measurements 


[2]. 

2. Even this procedure will not be effective if, as is often the case, the 
experimental operations involved in determining one point on the curve 
have an effect upon the values of other points. For example, one cannot 
use the same organism to determine all the points on a function 


From Psychological Bulletin, Vol. 49, 1952, pp. 263-269. Copyright 1952 by 
the American Psychological Association. Reproduced by permission. 
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relating extinction responding to number of reinforcements. One 
reason for this is that the extinction operation is itself a variable 
entering into extinction results subsequently to reconditioning [4]. 
It is seldom, if ever, possible to get around this difficulty by using a 
different individual for each point on the curve. Here inter-organism 
variability comes into the picture to obscure lawfulness. 


Faced with these problems, most experimenters turn to group data. 

One technique is to employ the same group of organisms to obtain all 

the points. This procedure, however, is also ruled out if the second 

. situation mentioned above is in effect (unless, of course, this is the 

problem under investigation). The only recourse remaining is to use a 

different group to determine each point. The rest of this paper will be 

devoted to a discussion of certain considerations involved in this 
latter method of obtaining an empirical function. 


Individual vs. 
averaged functions 


The first point to be made is that the mean curve obtained by such a 
procedure is not necessarily of the same shape as the inferred individual 
curves. (The term “inferred” is used here since this method is generally 
employed when the individual curves cannot be obtained directly.) 
The following development brings this out clearly. For the purpose 
of demonstration we take as our example the negatively accelerated 
Positive growth function which has achieved a certain prominence 1n 
behavior theory. This function can be expressed as 


y= M — Me™, (1) 


where M is the asymptote approached by y, and k determines the rate 
of approach to M. If the curves for individual organisms are of this 
shape, inter-organism variability might occur in the asymptotes 
approached by the curves, in the rates of approach to the asymptotes: 
or both. Figure 1 represents a set of individual curves which vary with 
respect to both constants. (Although, for the sake of simplicity 1" 
Figure 1, M and k are assumed to be positively correlated, this assump- 
tion is not necessary.) When, for the reasons mentioned above, it is 
not possible to obtain these individual curves empirically, the procedure 
generally followed is to expose a different sample of the population © 
subjects to each value of the independent variable, x, and to take the 
mean of the dependent variable as the corresponding value of y. On 
the assumption that each of the samples is equally representative © 
the population, this procedure is represented in Figure 1 by the broken 


M. Sidman 165 


1 

1 

I 

1 

1 

1 

I 

j 

l 
s Oe 


1 
1 
1 
1 
ji 
I 
1 
1 
1 
1 
1 
ji 
I 
|] 
I 
[ 
1 
I 
1 
I 
l 
I 
1 
1 
1 
L 

x 


Fig. 1 
Sample set of individual curves of the form, y = M — Me™. Each curve 
differs with respect to both M and k. 


lines drawn from selected values of x. These lines simply indicate that, 
in a given experiment, the distribution of functions is cut through at 


Selected values of the independent variable. i : 
Corresponding to the curves of Figure 1, we can write the following 


€quations: 
Jı = M, — Me™" 


Jy = = —k2x 
72 = M, — Me™ D 


Yn = M, — M,e7**. 
To determine the mean value of y for a given value of x we sum equa- 
tions (2) and divide by n, which results in the expression, 


) i=y= ni Sam = Smet), 6) 


S 
M 
i 


i=1 i=1 
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which can be written 


y= mf $M; — (Me™ + Me™ + Me™] : (4) 
i=1 
Each of the exponentials in 
S = Me™ + Me™ 4.--- + M,e7 6) 
can be expanded to give 
S = [M, + M,(—k,)x + M,(—k,)?x?/2! 
+ M,(—k,)?x3/3! + +++ J 
+ [M, + M,(—k,)x + M2(—k,)?x?/2! 
+ M(—k,)?x3/3! +] 
+ [M, + M,(—k,)x + M,(—k,)?x2/2! 
+ M(—k,)°x3/3! + +=], © 


Upon rearranging coefficients we have 
S = È Mi + [My(—ky) + M(—ky) +++- + M,(—k,)]x 


+ [Mi(—ky)? + Ma(—k3)? +++ + M,(—k,)?]x2/2! + 
+ [M (=k)" + Mka)" + + + M kaem ++) 
This can be expressed 


5 = 2M + A,X + A3x?/2! +--+ 4 A,X ml! 4, (8) 
where 
An = Mi(-ki)" + MAk)" ++ Mk)". a 


Substituting equation (8) into equation (4) we arrive at 
v=[1-(1 + Ax + A3x?/2) +--+ + A,x™/m! +++ Nn (10) 
Equation (10) will reduce to the form (1) if and only if 


A; = A; (11) 
for all i, j, or if 
A; = Ai. (12) 


m — ; ; ive 
Condition (11) is impossible, since the Æ’s are alternately nega 
and positive. Condition (12) is easily demonstrated to be imposs! 
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unless the Æ’s all equal unity, in which event (12) becomes a special 
case of (11). 

___ It has been shown, then, that for individual curves of the form (1), 
if inter-organism variability occurs both in the asymptotes and in the 
rates of approach to these asymptotes, the average curve cannot be 
described by an equation of the form (1).! It can be seen from equations 
(7) to (10) that this will also be the case if the asymptotes are equal 
and variability occurs only in the rates of approach. Only when the 
tates of approach are equal will the mean curve be of the form (1). 
Thus, under the assumption that 

ki = &, (13) 


J 
for all i, j, equation (4) can be rewritten 


n n 
y= n( ÈM,- fme). . (14) 
i=1 i=1 
As far as this writer is aware, the assumption that the rates of approach 
to the asymptotes are equal for all the organisms in a given experiment 
has never been explicitly acknowledged by any experimenter or theorist 
who has fitted this growth function to data obtained by the method 
under discussion.? . 

At this point it may be argued that although the mean curve is 
not the same as the individual curves, it is similar enough that, within 
the limits of experimental error, it can be fitted satisfactorily by the 
same function. Although this argument possesses dubious merit on 
grounds of theoretical consistency, it can also be attacked by demon- 
strating that many other types of individual curve will, if averaged, 
ive as good an approximation to (1) as will equation (10). 

For example, if the individual curves are straight lines of the form 
y= mx (15) 
up to a given value of x, at which point there occurs a discontinuity 
(see Figure 2) after which 
y=, (16) 


1. The author is indebted to Mr. L. A. Gardner, Jr. for the essential elements of 


this demonstration. 

2. Hull appears actually to have made the opposite assumption. He states, 
“The ‘constant’ numerical values appearing in equations representing 
primary molar behavioral laws vary . . . from individual to individual . . . ” 


[3, Postulate 18]. 
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Fig. 2 


Sample set of individual curves of the form y = mx up to a point of 
discontinuity, after which y = c. 


it can be shown that the mean curve can be described by 
Y = f(x)(l — ax) + aBx, (17) 


where B is the sum of the maximum values ofyandaisa proportionality 
constant between the slope, m, and the maximum value, ye, of Y- 
(This assumption of proportionality is not necessary, but is made 
merely to simplify the discussion.) f(x) is a function describing the 
relationship between x and the sum of the y. which have been reache 
at any value of x. (It can be seen from Figure 2 that, as x increases, 
more of the individual curves will reach their maximum values of Y.) 
F(x) will be determined by the frequency distribution of y, and the 
relation, if any, between y, and x. The form of f(x) will determine 
how closely equation (16) approximates equation (1). We see, then, that 
if the individual curves are of the form indicated in Figure 2, the mean 
curve may approximate equation (1) or any one of a large number O 
other forms. 

Although this discussion has treated only two specific cases in any 
detail, the same type of analysis can be carried out for any functional 
relation. In some cases it will be found that the mean curve will be © 
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the same form as the individual curves, e.g., the straight line. However, 
many functions will show the property discussed above, namely, that 
the mean curve cannot be of the same form as the individual curves 
except under special conditions. Furthermore, given a particular 
mean curve, the form of the individual curves is not uniquely specified. 
It appears, then, that when different groups of subjects are used to 
obtain the points determining a functional relation, the mean curve 
does not provide the information necessary to make statements 
concerning the function for the individual. 


Alternative procedures 


Given a situation like that outlined above, there are several alternatives 
open to experimenters and theorists. First the suggestion might be 
made that all data obtained by the averaging procedure outlined 
above be ignored and that the questions which such data attempt to 
answer not be asked. This radical solution is probably not necessary. 
Such mean curves may give some valuable information, depending 
upon the validity of the assumptions one is willing to make concerning 
the general lawfulness of individual behavior. If it is assumed that all 
individuals of a certain class will display the same type of functional 
relation in a given situation, then the mean curve will tell something 
about that function. If, under this assumption, we obtain the mean 
curve described by equation (10), it will be known that the individual 
curves are increasing functions of x and that they either reach a maxi- 
mum or approach an asymptote. 

However, we are not forced to make such an assumption. The 
mean curve of equation (10), for instance, can be obtained even if the 
individual curves are so irregular that they cannot be described by 
any useful equation. A more profitable approach might be to obtain 
and present all data in the form of distributions, to specify the distribu- 
tions by their form and by their parameters, and to relate these distri- 
butions to the independent variables. (Such procedures would, of 
course, apply also to experiments in which the same group of subjects 
is used to determine all points on a function, but where it is observed 
that the individual data are not amenable to a functional description.) 

A third alternative is to develop techniques which will produce 
lawful individual functions and to present the data without averaging. 
Although many methods have already been developed for work with 
individuals [e.g., 1, 5, 6, 7], many more would have to be worked out 
either by devising new measuring techniques or by attaining more 
rigorous control over extraneous variables. Statistical procedures 
would enter into these methods in at least two ways. Replicative 
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statistics would be necessary to determine the reliability of the individ- 
ual curves, and information would probably be needed concerning 
the population distribution of the curve constants. Even if these 
methods were highly developed, however, there are still some data (such 
as speed of acquisition of behavior under different motivating condi- 
tions) which will probably never be amenable to individual treatment. 
It will, in such cases, be necessary to forego such data or to use statistical 
analysis. If the latter is done there remains the problem of theoretical 
integration of data obtained by two different procedures. A decision 


among the alternate choices will be made only on the basis of further 
empirical investigation. 
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the problem of inference 
from curves based 
on group data 


W. K. Estes 


Papers by Sidman [8], Hayes [4], and Merrill [6] have raised serious 
questions about the validity of inferences from curves of functional 
relationship based on averaged data. By means of mathematical 
arguments and numerical illustrations, these writers have shown 
convincingly that “ . . . given a mean curve, the form of the individual 
curves is not uniquely specified” [8, p. 268]. This demonstration 
strikes close to home for the learning theorist. In the study of learning, 
we are interested in describing behavioral changes in individuals, 
but owing to limited control over behavioral variability must frequently 
depend upon averages for groups of organisms to determine functional 
relationships. In many areas we could scarcely remain in business if it 
were actually true that “...the mean curve does not provide the 
information necessary to make statements concerning the function 
for the individual” [8, p. 268]. Unfortunately it is true. More accu- 
rately, it is true if we regard the mean curve solely as a source of inductive 
generalizations. This qualification suggests that possibly the fault lies, 
Not in the averaged curves, but in our customary interpretations of 
them. 

It is noteworthy that learning theory, 
theory, has made rather steady progress in spi 


even quantitative learning 
te of the widespread 
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acceptance of a false methodological assumption. Apparently infer- 
ences from averaged curves, although not necessarily correct, must in 
fact often be so. This being the case, researchers in learning are unlikely 
to give up readily the habit of computing mean curves of functional 
relationship. My purpose in this note is to show that we need not feel 
obliged to try. The group curve will remain one of our most useful 
devices both for summarizing information and for theoretical analysis 
provided only that it is handled with a modicum of tact and under- 
standing. 

The principal point to be made is that the valid treatment of 
averaged curves depends upon the same principles of statistical 
inference that have become familiar to all of us in such cases as the 
analysis of variance and the chi-square test. Just as any mean score 
for a group of organisms could have arisen from sampling any of an 
infinite variety of populations of scores, so also could any given, mean 
curve have arisen from any of an infinite variety of populations of 
individual curves. Therefore no “inductive” inference from mean 
curve to individual curve is possible and the uncritical use of mean 
curves even for such purposes as determining the effect of an experi 
mental treatment upon rate of learning or rate of extinction is attende 
by considerable risk. These considerations set rather severe limitations 
upon the use of mean curves in the study of learning. Nonetheless 
we can anticipate that, as so regularly turns out to be the case 1 
scientific research, our virtue in accepting these limitations will not 
go unrewarded. The same type of theoretical inquiry that has led to 
Tecognition of the need for caution in handling averaged data may 
be turned in a constructive direction and lead to more effective 
exploitation of the one defensible and important theoretical application 
that remains for the averaged curve—the testing of exact hypotheses 
about individual functions. 

The first step in this direction is to recognize that the effects of 
averaging are not in any way capricious or unpredictable and need not 
be regarded as artifacts or distortions, Distortion arises only 1 
unwarranted inferences are drawn from the mean curves. But given 
any specified assumption about the form of individual functions 
we can proceed to deduce the characteristics to be expected of an 
averaged curve and then to test these predictions against obtaine 
data. As in any problem of statistical inference, it will always be true 
that other assumptions might yield the same predictions. The tas 
undertaken will be, however, to test, not the infinity of possible 
hypotheses, but only the one hypothesis under consideration. bë 

In testing quantitative theories against averaged data we may b) 
concerned either (a) with the form of a functional relationship O" 
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gh Berens Talne for the population of organisms sampled. 
a is il rated by the formerly popular pastime of trying to 
etermine the form of the learning curve” or by the attempts to 
verify Hull’s hypothesis that habit strength is an exponential function 
of number of reinforcements [5]. Case b is illustrated by attempts to 
determine whether the slope parameter of the habit growth curve 
depends upon amount of reinforcement [11] or whether the rate and 
asymptote of maze learning are functions of stimulus variability [9]. 
i In studies involving Case a, it has been customary to operate on the 
tacit assumption that the form of a mean curve will reflect faithfully 
the form of the individual curves. Since this assumption is now 
recognized to be unwarranted, we can no longer expect averaged data 
to yield any direct answer to the question, “What is the form of the 
individual function?” We can, however, replace this question with one 
which can be answered, namely, “Is the form of the mean empirical 
curve in accord with the assumption that the individual functions are 
ofa given form, say y = f(x, a, b, . . . )?” (In the remainder of the discus- 
sion we shall represent by f the function relating a dependent variable y 
to an independent variable x and parameters a, b, etc.) It becomes a 
specific mathematical or statistical research problem to determine 
for any given function f what testable predictions can be made 
concerning the mean curve for a group of organisms. Some pre- 
liminary considerations that may be helpful in dealing with this type of 
problem will be discussed below. 

In studies involving Case b the assumption has frequently been 
made that if the function obtained for the individual organism is 
y = f(x, a, b, . . . ), then the function describing the mean curve for a 
group of organisms should be y = f(x, Āā, b,...), ie, a curve of the 
same form with parameters equal to the means of the corresponding 
individual parameters. Since the assumption is not generally true, 
the treatment of this case will require, first, recognizing the instances in 
which the assumption holds, and, second, investigating instances in 
which it does not hold in order to determine what information about 
parameter values is obtainable from the mean curve. 


Classification of functions 


the mathematical functions that we will 
e classified into three types, each 
t. Let us consider briefly the 
of these types and illustrate 
1 in dealing with them. 


Relative to these problems, 
have occasion to deal with can b 
calling for somewhat different treatmen! 
problems that will arise in dealing with each 
some of the procedures that will prove usefu 
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Class A. Functions 
unmodified by averaging 


In these cases the mean curve for the group has the form of the individual 
function and the parameters of the mean curve are simply the means of 
the corresponding individual parameters. The chief problem here 1s 
that of defining the class of functions so that we will recognize instances 
of it. The essential characteristics of the class will be apparent from 
consideration of a few examples: 


1. y=a+t bx 

2. y=a+ bx + cx? 

3. y=alogx 

4. y=asinx + bcosx 
S y= ax. 


A numerical illustration involving one of these examples will 
show in a concrete way how the averaging process works out for this 
type of function. Suppose that we have two organisms whose behavior 
in a learning situation is described by the function y = alog x, where 
a is a constant which varies in value from one organism to another, 
but remains fixed in value throughout learning for any one organism. 
Let y, and y, be response measures for the two organisms, and let the 
value of a be 1 for the first organism and 2 for the second. Then 
the course of learning for the two organisms will be described by the 
equations 
yı = logx 


and 
Y2 = 2 log x, respectively. 


Now we compute the “empirical” response measures for each organism 
for the first four values of the independent variable x as indicated oF 
Table 1. Then by averaging the two response measures at each value ° 
x, we obtain the mean “empirical” curve represented by the values in 
the column headed y. It is clear, however, that the column of mean 
values also represents the values of the function y = 1.5 log x. There” 
fore the function describing the mean curve is of the same form as tHe 
individual functions, and the parameter of the function describing 
the mean curve is the mean of the individual parameters. iñ 
All functions belonging to this class work out similarly. * Stated i 


1. See Mathematical Note 1. 
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Table 1 

Effect of averaging a simple logarithmic function 

x log x Yı y2 y 1.5 

log x 

l .00 .00 .00 .00 .00 
2 -30 30 60 45 45 
3 48 48 96 72 ae 
4 -60 -60 1.20 .90 .90 


the simplest terms, what they all have in common is that each param- 
eter in the function appears either alone or as a coefficient multiplying 
a quantity which depends only on the independent variable x. In 
averaging, any quantity of the latter sort factors out at each value of x 
and appears in the mean curve, multiplying the mean value of the 
parameter. 


Class B. Functions for which 
averaging complicates the 
interpretation of parameters 
but leaves form unchanged 


Examples of functions falling in this class? are 


1. y = log bx 
£ . b 

2 y=-+—. 
a ax 


In the first example, we can rewrite the function in the form 


y = logb + log x; 

then it is apparent that the mean curve for a group of organisms which 
differ with respect to parameter b will be logarithmic in form, for the 
same reasons discussed in the preceding section, but will have the 
mean value of log b rather than log b as the intercept constant. Thus 
from a mean empirical curve, we can obtain an estimate of the geo- 
metric mean of the parameter b for the organisms sampled, but no 


estimate of the arithmetic mean of b. ; . 
In the second example, the mean curve of y vs. 1/x will be linear, 


2. See Mathematical Note 2. 
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but the parameters of the mean curve will be the mean values of 1/a 
and b/a for the organisms sampled, so no estimate of ā or b can be 
obtained from the averaged data. 

The testing of hypotheses involving functions in this class raises no 
difficulties if we are interested only in the form of the function; if we 
wish to estimate parameter values or to test hypotheses involving 
changes in parameter values as a function of experimental treatments, 
then. care must be taken to allow for the effects of averaging. 


Class C. Functions 
modified in form by 
averaging 


A function will fall in this class? if it contains any terms involving the 
independent variable x which will not factor out when we sum values 
of y over a group of organisms for a constant value of x. The most 


familiar example of a function belonging to this class is the “growth” 
curve 


y=a + be“ 


encountered in some guise or other in many learning theories, and 
given detailed discussion in Sidman’s paper [8]. 


___ In some cases, a function belonging to this class can be moved 
into Class B or even Class A by means ofan appropriate transformation. 
Take, for example, the exponential function given above. If the value 
of the parameter a is known for all individuals, it can be subtract? 
from the response measure y, leaving us with the simpler equation 


Y=y—a= be, 

ane latter can be made more tractable by the logarithmic transforma- 
ion 

log y' = log b — cx 

which when averaged yields 

E(log y’) = E(log b) — éx, 


Where E( ) represents the mean, or expected, value of the term 1” 


3. See Mathematical Note 3. 


k 
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parentheses. If, then, we take logarithms (base e) of the dependent 
variable y’ and plot the transformed variable as a function of x, both 
the curve for any individual and the averaged curve for a group will be 
linear; from the mean curve we can obtain estimates of the mean 
value of the parameter c and of the geometric mean of the parameter b. 
By means of this stratagem the problem of testing the hypothesis 
that an exponential function holds for individual organisms has been 
reduced to the very simple problem of determining whether the mean 
curve plotted from the transformed data departs significantly from 
linearity. Similarly, other hypotheses that might be tested against 
the group data are greatly simplified. Suppose, for example, that a 
theoretical curve of extinction took the form of this exponential 
function, with y being a response measure, x number of trials, and the 
asymptote a equal to zero, and that we were interested in the question 
whether some difference in the experimental treatments given two 
groups of organisms influenced rate of extinction; by means of the 
suggested transformation, this problem would reduce to that of testing 
for a difference in slope between two regression lines. A variety of 
transformations which may be useful in situations of this sort have been 
discussed by Mueller [7]. ; 

Even when functions in Class C cannot be moved into one of the 
more docile classes by any available transformation, or when for some 
reason transformation of the data is undesirable (as might be the case 
if a contemplated transformation produced heterogeneity of variances 
along the curve), we are not necessarily helpless. The extent to which 
functional form is modified by averaging will generally depend upon 
the dispersion of parameter values in the group of organisms sampled ; 
thus in some cases it may be possible by studying individual curves to 
estimate the dispersion of parameter values in the group and determine 
whether the form of the mean curve can be expected to conform closely 
to the form of the individual functions; see, e.g., [3]. Further, even 
in the case of the most refractory functions, it will usually be possible by 
appropriate mathematical analysis to derive the main characteristics 
that should be predicted for an averaged curve; an analysis of this sort 
for a “growth” function has been described in a recent paper [2]. 


The role of 
experimental error 


The analysis given here might be objected to on the grounds that we 
have considered only the effects of averaging upon data obtained from 
idealized organisms which behave strictly in accordance with theoretical 
functions. Response measures obtained from real organisms may, on 
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the other hand, be influenced by various sources of experimental error 
as well as by the variables taken account of in a given theory. The 
objection is pertinent, but not fatal. The answer is that in testing a 
theoretical prediction one must make some explicit assumption about 
the role of experimental error in the test situation. And as in any 
statistical test, the validity of the conclusions will be conditional upon 
the degree to which such assumptions are satisfied. In some instances, 
it may be reasonable to assume that the contribution of experimental 
error is negligible; then the analyses given above will apply without 
modification. Frequently it will be more reasonable to operate under 
the assumption, routinely made in working with analysis-of-variance 
models, that error combines additively with treatment effects to 
determine the observed response measures. In this case, if we wish to 
test the hypothesis that a function y = f(x, a, b,...) holds for indi- 
viduals, we will assume that the observed response measure Y for any 
individual is equal to the sum of y and a random variable e which 
represents the contribution of experimental error, i.e., 


Se Teoh (Oy Gb aA e 


Now if the error variable e is independent of x, and if the function f 


falls in our Class A, averaging of individual curves will yield a mean 
curve described by the function 


BEY C= fa bea RE 


If the mean value of e is zero, which will, for example, be the case 


Whenever the distribution of errors is normal, then the form of the 
mean curve will be unaffected by the error term: if the mean is not 
zero, then the mean function will be modified only by the addition of a 
constant and the plotted mean curve will be changed only by a vertica 
displacement. In some cases the error variable may interact wit 
experimental variables. If the nature of the interaction can be state 
explicitly, then its effects upon the averaging process can be determined 
by appropriate analysis. In situations where error variables an 
experimental variables interact in complex or unknown ways, exact 
tests of quantitative hypotheses will generally be impossible. 


Summary 


These comments are not meant to provide an exhaustive treatment at 
the problem of averaging. The one point I have tried to bring OY 
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clearly is that the valid interpretation of group curves* depends on the 
Principles common to all problems of statistical inference. Although 
the form of a group mean curve does not determine the forms of the 
individual curves, it does provide a means of testing exact hypotheses 
about them. In each particular case, the procedure must be to state 
explicitly the hypothesis under test, and then to derive the properties 
that should hold for the averaged curve if the hypothesis is correct. 
If the predictions thus derived are in accord with data, the hypothesis 
remains tenable; if they are not, then the hypothesis can be rejected at 
some specified level of confidence. Utilized within this framework, the 
averaged curve can be expected to remain one of the most valuable 
techniques for the analysis of behavioral data, and in fact to increase 
progressively in value as mathematical and statistical research continues 
to enlarge our repertory of special devices for the handling of particular 
problems. 


$ 
\ 


Mathematical notes 


1. A more formal criterion for class inclusion is desirable for some 
purposes, and may be formulated as follows.* Let us consider a func- 
tion y = f(x, a,b,...). At any given value of x, we may regard y as a 


4. Throughout this discussion we have spoken in terms of mean curves 
obtained from groups of organisms. Similar problems arise, and similar 
considerations apply, however, in the case of a curve whose points represent 
means of repeated measures on the same organism. Parameter values 
associated with an individual organism may vary either systematically or 
randomly during the course of an experiment. In either case, we may think 
of each possible combination of parameter values as determining a hypo- 
thetical curve, this population of curves being sampled at each value of the 
independent variable. Whether the obtained mean curve should be 
expected to have the same form as the hypothetical individual curves will 
depend on the nature of the mathematical function describing the latter 
and on the role of experimental error, just as in the case of a group curve. 

5. A criterion proposed by Bakan [1], which involves expanding the function 
in a Maclaurin series around the point x = 0, is not entirely satisfactory. 
For one thing it is frequently inapplicable. Take, for example, the functions 
y = alog x or y = x*; in neither case are the derivatives all continuous 
at x = 0, so in neither case will the series generally represent the function. 
The criterion suggested in the present paper will hold for all functions 
which can be expanded by Taylor's theorem, a class which includes all the 
elementary functions and, in fact, all explicit functions that the psychologist 


is apt to have dealings with. 
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function of the parameters a, b, etc, and expand the function in a 
Taylor’s series around the mean values of the parameters [6, 10], 
obtaining the relation 


(Aa)? pay. 


V =f, a,b...) + (Aa)f, + (Ab)f, +-°> + 5 


where @ + Aa is the value of the a parameter for a given boar 
fi, represents the ith derivative of y with respect to a, evaluated a 


a = a; and so on. When the function is averaged over a group of 
individuals, we obtain 


P= fla b,...) + to2f2 + doz f2 4 ---, 


Our criterion for inclusion of a function in Class A may now be stated: 
if in the Taylor’s series development, all second and higher order 
partial derivatives of the function with respect to parameters are Zero, 
then the function is unmodified by averaging. Applying the criterion 
to y = alog x, we have f, = log x; fè = 0; and therefore ý = Glog x. 


in agreement with the conclusion reached above by a more informal 
route, 


2. A sufficient criterion for inclusion of a function y=f(x a b) 
in Class B is that it does not satisfy the criterion of Class A when 
expanded around G, b, etc., but does satisfy that criterion when Te 
written y = f(x, u, v,...)and expanded around ii, D, etc. (u, v, etc. being 
functions of the parameters a, b,...). In the first example under Class 
B above, this criterion is satisfied if we let log b = u; in the secon 
example, it is satisfied if we let l/a = u and b/a = v. 


3. Ifa function falls in Class C, then in the Taylor’s series develop- 
ments described above, some of the second or higher order derivatives 
will depend on x regardless of how u, v, etc. are chosen, and thus ae 
criteria for Class A or Class B cannot be satisfied. ' 

It will be noted that these formal criterja provide more rigoron 
definitions of the various classes than can be given in nonmathematic i 
terms. However, it should be emphasized that the conclusions apon 
inference from averaged curves that we have reached in this paper, Al 
not depend on abstruse mathematical analyses. In many prar 7 
Situations, questions concerning the effects of averaging can be hand a 
by simple numerical methods of the type illustrated in an earlier section. 
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N= J 


William F. Dukes 


In the search for principles which govern behavior, psychologists 
generally confine their empirical observations to a relatively smal 
sample of a defined population, using probability theory to help assess 
the generality of the findings obtained. Because this inductive process 
commonly entails some knowledge of individual differences in the 
behavior involved, studies employing only one subject (N = 1) seem 
Somewhat anomalous. With no information about intersubject 
variability in performance, the general applicability of findings 1$ 
indeterminate. 

Although generalizations about behavior rest equally upo” 
adequate sampling of both subjects and situations, questions about 
sampling most often refer to subjects. Accordingly, the term “N = j 
1s used throughout the present discussion to designate the reductio a4 
absurdum in the sampling of subjects. It might, however, equally wel 
(perhaps better, in terms of frequency of occurrence) refer to the 
limiting case in the sampling of situations—for example, the use of one 
maze in an investigation of learning, or a simple tapping task in a study 


From Psychological Bulletin, Vol. 64 (No. 1), 1965, pp. 74-79. Copyright 1965 
by the American Psychological Association. Reproduced by permission. 
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of motivation. With respect to the two samplings, Brunswik (1956), 
foremost champion of the representative design of experiments, 
speculated : 


In fact, proper sampling of situations and problems may in the end be 
more important than proper sampling of subjects, considering the fact 
that individuals are probably on the whole much more alike than are 
situations among one another [p. 39]. 


As a corollary, the term N = 1 might also be appropriately 
applied to the sampling of experimenters. Long recognized as a 
potential source of variance in interview data (e.g, Cantril, 1944; 
Katz, 1942), the investigator has recently been viewed as a variable 
which may also influence laboratory results (e.g, McGuigan, 1963; 
Rosenthal, 1963). 

Except to note these other possible usages of the term N = 1, 
the present paper is not concerned with one-experimenter or one- 
situation treatments, but is devoted, as indicated previously, to 
single-subject studies. 

Despite the limitation stated in the first paragraph, N = 1 studies 
cannot be dismissed as inconsequential. A brief scanning of general 
and historical accounts of psychology will dispel any doubts about 
their importance, revealing, as it does, many instances of pivotal 
research in which the observations were confined to the behavior of 


only one person or animal. 


Selective historical review 


Foremost among N = 1 studies is Ebbinghaus’ (1885) investigation of 
memory. Called by some authorities “a landmark in the history of 
psychology ...a model which will repay careful study [McGeoch 
and Irion, 1952, p. 1],” considered by others “a remedy .. . at least 
as bad as the disease [Bartlett, 1932, p. 3],” Ebbinghaus’ work 
€stablished the pattern for much of the research on verbal learning 
during the past 80 years. His principal findings, gleaned from many 
self-administered learning situations consisting of some 2,000 lists 
of nonsense syllables and 42 stanzas of poetry, are still valid source 
material for the student of memory. In another well-known pioneering 
Study of learning, Bryan and Harter’s (1899) report on plateaus, 
certain crucial data were obtained from only one subject. Their 
letter-word-phrase analysis of learning to receive code was based on 
the record of only one student. Their motion of habit hierarchies 
derived in part from this analysis is, nevertheless, still useful in explain- 


ing why plateaus may occur. 
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Familiar even to beginning students of perception is Stratton’s 
(1897) account of the confusion from and the adjustment to wearing 
inverted lenses. In this experiment according to Boring (1942), 
Stratton, with only himself as subject ; 


settled both Kepler’s problem of erect vision with an inverted image, and 
Lotze’s problem of the role of experience in space perception, by showing 
that the “absolute” localization of retinal positions—up-down and 
right-left—are learned and consist of bodily orientation as context to 
the place of visual excitation [p. 237]. 


The role of experience was also under scrutiny in the Kelloggs’ 
(1933) project of raising one young chimpanzee, Gua, in their home. 
(Although observations of their son’s behavior were also included in 
their report, the study is essentially of the N = 1 type, since the 
“experimental group” consisted of one.) This attempt to determine 
whether early experience may modify behavior traditionally regarded 
as instinctive was for years a standard reference in discussions of the 
learning-maturation question. : 

Focal in the area of motivation is the balloon-swallowing experi- 
ment of physiologists Cannon and Washburn (1912) in which kymo- 
graphic recordings of Washburn’s stomach contractions were shown 
to coincide with his introspective reports of hunger pangs. Their 
findings were widely incorporated into psychology textbooks as 
Providing an explanation of hunger. Even though in recent years 
greater importance has been attached to central factors in hunger: 
Cannon and-Washburn’s work continues to occupy a prominent place 
in textbook accounts of food-seeking behavior, 

In the literature on emotion, Watson and Rayner’s study (1920) 
of Albert’s being conditioned to fear a white rat has been hailed a 

one of the most influential papers in the history of American psychol- 
ogy” [Miller, 1960, p. 690]. Their experiment, Murphy (1949) observes 


immediately had a profound effect on American psychology; fo" y 


pepeared to support the whole conception that not only simple moto! 
gis. but important, enduring traits of personality, such as emotions 
tendencies, may in fact be ‘built into’ the child by conditioning [p- 201! 


Actually the Albert experiment was unfinished because he moved 
away from the laboratory area before the question of fear remové 
could be explored. But Jones (1924) provided the natural sequel r 
Peter, a child who, through a process of active reconditioning, overca™ 

a nonlaboratory-produced fear of white furry objects. s 
._ Tn abnormal psychology few cases have attracted as much aei 
tion as Prince’s (1905) Miss Beauchamp, for years the model case Í 
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accounts of multiple personality. An excerpt from the Beauchamp 
case was recently included, along with selections from Wundt, James 
Pavlov, Watson, and others, in a volume of 36 classics in psychology 
(Shipley, 1961). Perhaps less familiar to the general student but more 
significant in the history of psychology is Breuer’s case (Breuer and 
Freud, 1895) of Anna O., the analysis of which is credited with con- 
taining “the kernel of a new system of treatment, and indeed a new 
system of psychology [Murphy, 1949, p. 307].” In the process of 
examining Anna’s hysterical symptoms, the occasions for their appear- 
ance, and their origin, Breuer claimed that with the aid of hypnosis 
these symptoms were “talked away.” Breuer’s young colleague was 
Sigmund Freud (1910), who later publicly declared the importance 
of this case in the genesis of psychoanalysis. 

There are other instances, maybe not so spectacular as the pre- 
ceding, of influential N = 1 studies—for example, Yerkes’ (1927) 
exploration of the gorilla Congo’s mental activities; Jacobson’s 
(1931) study of neuromuscular activity and thinking in an amputee; 
Culler and Mettler’s (1934) demonstration of simple conditioning in a 
decorticate dog; and Burtt’s (1932) striking illustration of his son’s 
residual memory of early childhood. 

Further documentation of the significant role of N = 1 research in 
Psychological history seems unnecessary. A few studies, each in 
impact like the single pebble which starts an avalanche, have been the 
impetus for major developments in research and theory. Others, 
more like missing pieces from nearly finished jigsaw puzzles, have 
provided timely data on various controversies. 

This historical recounting of “successful” cases is, of course, not 
an exhortation for restricted subject samplings, nor does it imply that 
their greatness is independent of subsequent related work. 


Frequency and 

range of topics 

In spite of the dated character of the citations—the latest being 
1934—N = 1 studies cannot be declared the product of an era un- 
sophisticated in sampling statistics, too infrequent in recent psychology 
to merit attention. During the past 25 years (1939-1963) a total of 
246 N = 1 studies, 35 of them in the last 5-year period, have appeared 
in the following psychological periodicals: the American Journal of 
Psychology, Journal of Genetic Psychology, Journal of Abnormal 
and Social Psychology, Journal of Educational Psychology, Journal 
of Comparative and Physiological Psychology, J ournal of Experimental 
Psychology, Journal of Applied Psychology, Journal of General Psy- 
chology, Journal of Social Psychology, Journal of Personality, and 
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Journal of Psychology. These are the journals, used by Bruner and 
Allport (1940) in their survey of 50 years of change in American 
psychology, selected as significant for and devoted to the advancement 
of psychology as science. (Also used in their survey were the Psycho- 
logical Review, Psychological Bulletin, and Psychometrika, excluded 
here because they do not ordinarily publish original empirical work.) 
Although these 246 studies constitute only a small percent of the 
1939-1963 journal articles, the absolute number is noteworthy and 
is sizable enough to discount any notion that N = 1 studies are a 
phenomenon of the past. p 
When, furthermore, these are distributed, as in Table 1, according 
to subject matter, they are seen to coextend fairly well with the range 
of topics in general psychology. As might be expected, a large propor- 
tion of them fall into the clinical and personality areas. One cannot, 
however, explain away N = 1 studies as case histories contributed by 
clinicians and personologists occupied less with establishing generaliza- 
tions than with exploring the uniqueness of an individual and under- 
standing his total personality. Only about 30% (74) are primarily 
oriented toward the individual, a figure which includes not only 
works in the “understanding” tradition, but also those treating the 
individual as a universe of responses and applying traditionally 
nomothetic techniques to describe and predict individual behavior 
(e.g, Cattell and Cross, 1952: Yates, 1958). d 
In actual practice, of course, the two orientations— towar 
uniqueness or generality—are more a matter of degree than of mutua 
exclusion, with the result that in the literature surveyed purely i e 
graphic research is extremely rare. Representative of that approac’) 
are Evans’ (1950) novel-like account of Miller who “spontaneously 
recovered his sight after more than 2 years of blindness, Rosen $ 


(1949)“GeorgeX: A self-analysis by an avowed fascist,” and McCurdy's 
(1944) profile of Keats, 


Rationale for N = 1 


The appropriateness of restricting an idiographic study to one om 
vidual is obvious from the mean of the term. If uniqueness is involv ah 
a sample of one exhausts the population. At the other extraneis 
N of 1 is also appropriate if complete population generality = en- 
(or can reasonably be assumed to exist). That is, when betwee 
individual variability for the function under scrutiny is known tO int 
negligible or the data from the single subject have a point-for-pe 
congruence with those obtained from dependable collateral —" me 
results from a second subject may be considered redundant. ne 
N = 1 studies may be regarded as approximations of this ideal © 
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Table 1 

Total distribution of N = 1 studies (1939—1963) 

Category F Examples 

Maturation 29 Sequential development of prehension in a 

development macaque (Jensen, 1961); smiling in a human 
infant (Salzen, 1963) 

Motivation 7 Differential reinforcement effects of true, 
esophageal, and sham feeding in a dog (Hull, 
Livingston, Rouse, and Barker, 1951) 

Emotion 12 Anxiety levels associated with bombing 
(Glavis, 1946) 

Perception, 25 Congenital insensitivity to pain in a 19-year- 

Sensory processes old girl (Cohen, Kipnis, Kunkle, and Kub- 
zansky, 1955); figural aftereffects with a 
stabilized retinal image (Krauskopf, 1960) 

Learning 27 Delayed recall after 50 years (Smith, 1963); 
imitation in a chimpanzee (Hayes and Hayes, 
1952) 

Thinking, language 15 “Idealess” behavior in a chimpanzee (Razran, 
1961); opposite speech in a schizophrenic 
patient (Laffal and Ameen, 1959) 

Intelligence 14 Well-adjusted congenital hydrocephalic with 
IQ of 113 (Teska, 1947); intelligence after 
lobectomy in an epileptic (Hebb, 1939) 

Personality 51 Keats’ personality from his poetry (McCurdy, 
1944); comparison in an adult of P and R 
techniques (Cattell and Cross, 1952) 

Mental health, 66 Multiple personality (Thigpen and Cleckley, 

Psychotherapy 1954); massed practice as therapy for patient 
with tics (Yates, 1958) 

Total 246 


as for example, Heinemann’s (1961) photographic measurement of 
retinal images and Bartley and Seibel’s (1954) study of entoptic stray 
light, using the flicker method. 

A variant on this typicality t 
Order to preserve some kind o 
dramatize a point, reports in depth one case w 


heme occurs when the researcher, in 
f functional unity and perhaps to 
hich exemplifies many. 
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Thus Eisen’s (1962) description of the effects of early sensory depriva- 
tion is an account of one quondam hard-of-hearing child, and Bettel- 
heim’s (1949) paper on rehabilitation a chronicle of one seriously 
delinquent child. 

In other studies an N of 1 is adequate because of the dissonant 
character of the findings. In contrast to its limited usefulness 1n 
establishing generalizations from “positive” evidence, an N of 1 when 
the evidence is “negative,” is as useful as an N of 1,000 in rejecting an 
asserted or assumed universal relationship. Thus Krauskopf's (1960) 
demonstration with one stopped-image subject eliminates motion of 
the retinal image as necessary for figural aftereffects; and Lenneberg S 
(1962) case of an 8-year-old boy who lacked the motor skills necessary 
for speaking but who could understand language makes it “clear that 
hearing oneself babble is not a necessary factor in the acquisition of 
understanding . . . [p. 422].” Similarly Teska’s (1947) case of a 
congenital hydrocephalic, 64 years old, with an IQ of 113, is sufficient 
evidence to discount the notion that prolonged congenital hydro- 
cephaly results in some degree of feeblemindedness. 

While scientists are in the long run more likely to be interested in 
knowing what is than what is not and more concerned with how many 
exist or in what proportion they exist than with the fact that at least one 
exists, one negative case can make it necessary to revise a traditionally 
accepted hypothesis. A 

Stillother N = 1 investigations simply reflect a limited opportunity 
to observe. When the search for lawfulness is extended to infrequent 

nonlaboratory” behavior, individuals in the population under study 
may be so sparsely distributed spatially or temporally that the PSY- 
chologist can observe only one case, a report of which may be useful as 4 
part of a cumulative record. Examples of this include cases of multiple 
personality (Thigpen and Cleckly, 1954), unilateral color blindness 


(Graham, Sperling, Hsia, and Coulson, 1961), congenital insensitivity 


as well as subject sparsity may limit the opportunity to observe 
When the situation is greatly extended in time, requires expensive © 
Specialized training for the subject, or entails intricate and difficult to 
administer controls, the investigator may, aware of their explorator 
character, restrict his observations to one subject. Projects invo a 
home-raising a chimpanzee (Hayes and Hayes, 1952) or ig Be 
1941), would seem to illustrate this use of an N of 1. ic; 
Not all N = 1 studies can be conveniently fitted into this ro 
nor is this necessary. Instead of being oriented either toward 
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person (uniqueness) or toward a global theory (universality), researchers 
may sometimes simply focus on a problem. Problem-centered research 
on only one subject may, by clarifying questions, defining variables, 
and indicating approaches, make substantial contributions to the 
study of behavior. Besides answering a specific question, it may 
(Ebbinghaus work, 1885, being a classic example) provide important 
groundwork for the theorists. : 

Regardless of rationale and despite obvious limitations, the 
usefulness of N = 1 studies in psychological research seems, from the 
preceding historical and methodological considerations, to be fairly 
well established. (See Shapiro, 1961, for an affirmation of the value 
of single-case investigations in fundamental clinical psychological 
research.) Finally, their status in research is further secured by the 
statistician’s assertion (McNemar, 1940) that: 


The statistician who fails to see that important generalizations from 
research on a single case can ever be acceptable is on a par with the 
experimentalist who fails to appreciate the fact that some problems can 
never be solved without resort to numbers [p. 361]. 
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The final block of readings deals with the role of statistics in a 
with null hypothesis testing, and with tests of be pape eet ies 
article by Nunnally there is a description of what the author be nally 
to be misconceptions about the use of statistical methods. Nun with 
discusses what he terms some psychologists’ preoccupa gon T i 
statistics” and the hypothesis testing model for finding “signi ‘eis 
differences.” The author reminds us that the task of psychologis ek 
to discover lawful relations in behavior, not significant es ar etee 
In the next article, by Tukey, a distinction is made between conclusi n 
and decisions. Also included in the article is a discussion of tests 
significance, tests of hypotheses, and point estimates. any 
Since null hypothesis testing has been of central concern to E 
psychologists, a number of articles are included in this block ne 
critically evaluate the strengths and weaknesses of this proce nal 
Rozeboom’s article contains Serious objections to the “traditio 


Lege ae f he 
null-hypothesis significance-test method” Among other things, 


TE E ; es,” 
1s critical of what he calls “decisions to accept or to reject hypotheses., 
and he would prefer to use 


“degrees of believing or disbelew n 
At the conclusion of his article a number of suggestions are foun 
strengthening our data evaluation procedures. tof 
The article by Bakan contains a further discussion of the a 
significance. Bakan takes the null hypothesis to task and points the 
that it is generally false under any circumstances. In addition. ach 
article contains a review of both the Fisher and the Bayesian appar the 
as a basis for inference, and concludes with the suggestion tique 
Bayesian approach may be the more appropriate of the two. A T the 
of the emphasis placed on statistical significance is also foun ificance 
article by Lykken. Lykken’s conclusion is that statistical E point 
may be the least important characteristic of good research. TEE the 
is made that other things determine the value of research, suc ed, an 
degree of experimental control, the measuring techniques 5 also 
the scientific or practical importance of the phenomenon. Ly 


: RSAT istinguishes amon 
emphasizes the importance of replication and distinguishes 
several kinds. 
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The differences in theory testing between psychology and physics 
are discussed in Meehl’s article. Meehl points out that the role of 
Statistical significance in psychology is the reverse of that in physics. 
In addition to discussing the logic and methodology of science, Meehl 
presents a brief review of the process of statistical inference. 

Several of the articles deal specifically with the problems associated 
with testing the null hypothesis and provide a series of arguments and 
counterarguments. Grant argues that accepting the null hypothesis is 
an inappropriate way of seeking support for a theory. Binder makes 
rebuttal to this argument, and the argument is carried even further 
by Wilson and Miller. The latter two authors tend to agree with Grant 
in that they also argue that rejection is better than acceptance of the 
null hypothesis. Disagreement with the rejection-support position is 
found in the article by Edwards. Edwards notes that classical statis- 
tics—in contrast to Bayesian statistics—is strongly biased against the 
null hypothesis. He feels that a conservative investigator should 
identify his theory with the null hypothesis. Several of the points made 
in the article by Edwards are challenged in the paper by Wilson, 
Miller, and Lower. 


13 


the place of statistics 
in psychology 


Jum Nunnally 


Most psychologists probably will agree that the emphasis on statistica 
methods in psychology is a healthy sign. Although we so 
substitute statistical elegance for good ideas, and over-embellish ee 
studies with elaborate analyses, we are probably on a firmer Dar a eTe 
we were in the prestatistical days. However, it will be argued that hic 
are some serious misemphases in our use of statistical methods, W 
are retarding the growth of psychology. atistical 
The purpose of this article is to criticize the use of will be 
“hypothesis-testing” models and some related concepts. It ith the 
argued that the hypothesis-testing models have little to do Y Ioui- 
actual testing of hypotheses and that the use of the models nan rnative 
aged some unhealthy attitudes toward research. Some alte 
approaches will be suggested. mE d by 
a Few, if any, of the E which will be made were oriei A acn 
the author, and, taken separately, each is probably a we broug 
“straw man.” However, it is hoped that when the criticisms are 


4), 1960; 
From Educational and Psychological Measurement, Vol. 20 o ot and 
Pp. 641-650. Reprinted by permission of J. Nunnally and Educ 
Psychological Measurement. 
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together they will argue persuasively for a change in viewpoint about 
Statistical logic in psychology. 


What is wrong? 


Most will agree that science is mainly concerned with finding functional 
relations. A particular functional relationship may be studied either 
because it is interesting in its own right or because it helps clarify a 
theory. The functional relations most often sought in psychology are 
Correlations between psychological variables, and differences in central 
tendency in differently treated groups of subjects. Saying it in a simpler 
manner, psychological results are usually reported as correlation 
Coefficients (or some extension thereof, such as factor analysis) and 
differences between means (or some elaboration, such as a complex 
analysis of variance treatment). 


Hypothesis testing 
After an experiment is completed, and the correlations or differences 
between means have been obtained, the results must be interpreted. 
The experimenter is aware of sampling error and realizes that if the 
experiment is run on different groups of subjects the obtained relations 
will probably not be the same. How then should he take into account 
the chance element in the obtained relationship? In order to intep 
the results, the experimenter would, as most of us have, rely go me 
statistical models for hypothesis testing. It will be argued t at | p 
hypothesis-testing models are inappropriate for nearly all psychologica 
studies, E J 
Statistical hypothesis testing is a decision theory: you have on 

or more alternative courses of action, and the theory leads to the 
choice of one or several of these over the others. Although the theory 
is very useful in some practical circumstances (such as n qe Ly 
control”), it is misnamed. It has very little to do with hypo m 
testing in the way that hypotheses are tested in the work-a-day wor 


Of scientific activity. ; : 
hee ae and misconceived hypothesis-testing eid 
employed in psychology is referred to as the bo nel ponies a 
Stating it crudely, one null hypothesis would be that two Go I ae 
not produce different mean effects in the long run. Using the a aima 
means and sample estimates of “population” variances, pro + uy 
statements can be made about the acceptance or rejection of the nul 
hypothesis. Similar null hypotheses are applied to eens, 
Complex experimental designs, factor-analytic results, and most al 


experimental results. 


198 Research Problems in Psychology 


Although from a mathematical point of view the meas Ee 
models are internally neat, they share a crippling flaw: Ai none 
world the null hypothesis is almost never true, and it paler Ne Atte 
sensical to perform an experiment with the sole aim n ie pe 
null hypothesis. This is a personal point of view, and i mon sense 
proved directly. However, it is supported both by com On Soa 
and by practical experience. The common-sense ee ie ee long 
different psychological treatments will almost alang (a be 
tun) produce differences in mean effects, even though t robably 
may be very small. Also, just as nature abhors a vacuum, it p 
abhors zero correlations between variables. F are used 

Experience shows that when large auropea Enn different 
in studies, nearly all comparisons of means are “significantly he author 
and all correlations are “significantly” different from zero. The Tone 
once had occasion to use 700 subjects in a study of public Pei 
After a factor analysis of the results, the factors were corr ait 
with individual-difference variables such as amount of education, fi 
income, sex, and others. In looking at the results I was happy Te 
so many “significant” correlations (under the null-hypothesis mo that 
indeed, nearly all correlations were significant, including tee 
made little sense. Of course, with an N of 700 correlations al tions 
as .08 are “beyond the .05 level.” Many of the “significant” correla 
were of no theoretical or practical importance. -isnot 

The point of view taken here is that if the null Ly nis 
rejected, it usually is because the N is too small. If enough a the 
gathered, the hypothesis will generally be rejected. If rejection E 
null hypothesis were the real intention in psychological experim 
there usually would be no need to gather data. “two-tail 

The arguments above apply most straightforwardly to aoe 
tests,” which are used in most experiments. A somewhat bermi test. 
ment can be made for using the null hypothesis in the een not 
However, even in that case, if rejection of the null hypotes an 
obtained for the specified direction, the hypothesis can be rever: 
rejection will usually occur. ases 

Perhaps my intuitions are wrong—perhaps there are man cases 
in which different treatments produce the same effects and ma oo the 
in which correlations are exactly zero. Even so, the emp ao mere 
null-hypothesis models is unfortunate. As is well recognize formation: 
rejection of a null hypothesis provides only meager in A from 
For example, to say that a correlation is “significantly y sas Som 
zero provides almost no information about the relations Fal how 
would argue that finding “significance” is only the first step, 
many psychologists ever go beyond this first step? 
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Psychologists are usually not interested in finding tiny relation- 
ships. However, once this is admitted, it forces either a modification 
or an abandonment of the null-hypothesis model. 

An alternative to the null hypothesis is the “fixed-increment” 
hypothesis. In this model, the experimenter must state in advance 
how much of a difference is an important difference. The model 
could be used, for example, to test the differential effect of two methods 
of teaching psychology, in which an achievement test is used to measure 
the amount of learning. Suppose that the regular method of instruction 
obtains a mean achievement test score of 45. In the alternative method 
of instruction, laboratory sessions are used in addition to lectures. 
The experimenter states that he will consider the alternative method of 
instruction better if, in the long run, it produces a mean achievement 
test score which is at least ten points greater than the regular method of 
instruction, Suppose that the alternative method actually produces a 
mean achievement test score of 65. The probability can then be 
determined as to whether the range of scores from 55 upwards covers 
the “true” value (the parameter). l . 

The difficulty with the “fixed-increment” hypothesis-testing model 
is that there are very few experiments in which the increment can be 
Stated in advance. In the example above, if the desired statistical 
confidence could not be found for a ten point increment, the experi- 
menter would probably try a nine point increment, then an eight point 
increment, and so on. Then the experimenter is no longer operating 
with a hypothesis-testing model. He has switched to a confidence- 
interval model, which will be discussed later in the article. 


The small N fallacy 
Closely related to the null hypothesis is the notion that only enough 
subjects need be used in psychological experiments to obtain 
“significant” results. This often encourages experimenters to be 
content with very imprecise estimates of effects. In those situations 
where the dispersions of responses are small, only a small number of 
subjects is required. However, such situations are seldom encountered 
in psychology. The question, “When is the N large enough?” will be 
discussed later in the article. 


Even if the object in experimental studies were to test the null 


hypothesis, the statistical test is often compromised by the small N. 
The tests depend on assumptions like homogeneity of variance, and 
the small N study is not sufficient to say how well the assumptions 
hold. The small N experiment, coupled with the null hypothesis, is 
usually an illogical effort to leap beyond the confines of limited data 
to document lawful relations in human behavior. 
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The sampling fallacy 


In psychological experiments we speak of the group of subjects as a 
“sample” and use statistical sampling theory to assess the results. of 
course, we are seldom interested only in the particular group of subjects, 
and it is reasonable to question the generality of the results in wider 
collections of people. However, we should not take the sampling notion 
too seriously, because in many studies no sampling is done. In many 
studies we are content to use any humans available. College freshmen 
are preferred, but in a pinch we will use our wives, secretaries, janitors, 
and anyone else who will participate. We should then be a bit cautious 
in applying a statistical sampling theory, which holds only ac 
individuals are randomly or systematically drawn from a define 
population. 


The crucial experiment 


Related to the misconceptions above are some misconceptions ae 
crucial experiments. Before the points are argued, a distinction shou 

be made between crucial designs and crucial sets of data. A cue 
design is an agreed-on experimental procedure for testing a theoretica 


s . ne 
Although crucial designs have played important parts in so” 
areas of science, few o 


Psychology it is more often the case that experimenters propos? 
different designs for t 
mental designs that apparently differ in small ways often produce 
different relationships. However, this is not a serious bother. Anti 
thetical results should lead to more comprehensive theory. 

A more serious concern is whether particular sets of experimen ts 
data can be regarded as crucial. Even when different psycholog’s 
employ the same design they often obtain different relationsh Pi 
Such inconsistencies are often explained by “sampling error e, 
this is not a complete explanation. Even when the N’s are ae 
It is sometimes reported that Jones finds a positive correlation, ae o 
a negative correlation, and Brown a nil correlation. The resi at 
psychological studies are sometimes particular to the experime! sy 
and the time and place of the experiment. This is why most Pa 
chologists would place more faith in the results of two studies, © at 
with 50 subjects, performed by different investigators in dii 
places, than in the results obtained by one investigator for 100 auhi 
Then we must be concerned not only with the sampling of peop!© 


ntal 
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with the sampling of experimental environments as well. The need to 
“sample” experimental environments is much greater in some types 
of studies than in others. For example, the need probably would be 
greater in group dynamic studies than in studies of depth perception. 


What should be done? 


Estimation 


Hypotheses are really tested by a process of estimation rather than with 
Statistical hypothesis-testing models. That is, the experimenter wants 
to determine what the mean differences are, how large the correlation is, 
what form the curve takes, and what kinds of factors occur in test 
Scores. If, in the long run, substantial differences are found between 
effects or if substantial correlations are found, the experimenter can 
then speak of the theoretical and practical implications. E 

To illustrate our dependence on estimation, analysis of variance 
should be considered primarily an estimation device. The variances 
and ratios of variances obtained from the analysis are unbiased 
estimates of different effects and their interactions. The proper 
questions to ask are, “How large are the separate variances?” and 
“How much of the total variance is explained by particular classifica- 
tions?” Only as a minor question should we ask whether or not the 
separate sources of variance are such as to reject the null hypothesis. 
Of course, if the results fail to reject the null hypothesis, they should 
not be interpreted further; but if the hypothesis is rejected, this 
should be considered only the beginning of the analysis. 

Once it is realized that the basis for testing psychological hypoth- 
eses is that of estimation, other issues are clarified. For example, 
the Gordian-knot can be cut on the controversial issue of “proving 
the null hypothesis. If, in the long run, it is found that the means of 
two differently treated groups differ inconsequentially, there is nothing 
Wrong with believing the results as they stand. 


Confidence intervals 


Itis not always necessary to use a large N, and there are ways of telling 


when enough data has been gathered to have faith in statistical 


estimates. Most of the statistics which are used (means, variances, 


Correlations, and others) have known distributions, and, from these, 


confidence intervals can be derived for particular estimates. For 


example, if the estimate of a correlation is .50, a confidence interval 
can be set for the inclusion of the “true” value. It might be found in this 
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way that the probability is .99 that the “true” value! is at least as high 
as .30. This would supply a great deal more information than to reject 
the null hypothesis only. 

The statistical hypothesis-testing models differ in a subtle, but 
important, way from the confidence methods, The former make 
decisions for the experimenter on an all-or-none basis. The latter tell 
the experimenter how much faith he can place in his estimates, and 
they indicate how much the N needs to be increased to raise the preci- 
sion of estimates by particular amounts. 

The null-hypothesis model occurs as a special case of the confidence 
models. If, for example, in a correlational study the confidence 
interval covers zero, then, in effect, the null hypothesis is not rejected. 
When this occurs it usually means that not enough data has been 
gathered to answer the questions at issue. 


Discriminatory power 


In conjunction with making estimates and using confidence methods 
with those estimates, methods are needed for demonstrating the 
strength of relationships. In correlational studies, this need is served 
by the correlations themselves. In measuring differences in central 


tendency for differently treated groups, no strength-of-relationship 
measure is generally used, 


would serve the purpose). The dichotomous “group 
n correlated with the dependent variable. When the N 


relationship measure that can be 
ean differences. The statistic is 


Epsilon is an unbiased estim 


} p ate of the correlation ratio, Eta. It 
is unbiased because “degrees of fr 


eedom” are employed in the variance 


1l. Technically, it would be more correct to say that the probability is 99 
that the range from .30 to 1.00 covers the parameter, 
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estimates. To show how Epsilon is applied, consider the one- 
classification analysis of variance results shown in Table 1. 

_ _Epsilon is obtained by dividing the error variance (in the example 
in Table 1, the within columns variance) by the total variance, sub- 
tracting that from one, and taking the square-root of the result. 
The one classification in Table 1 explains 49 percent of the total 
variance, which shows that the classification has high discriminatory 
power. Of course, in this case, the null hypothesis would have been 
rejected, but that is not nearly as important as it is to show that the 
classification produces strong differences. 


Table 1 
Hypothetical results illustrating the use of epsilon 
Source Sums of squares df Variance Est. 
Experimental treatments 510 4 127.50 
(between column means) 
Within columns 490 119 4.12 
Total 1000 123 8.13 
ilon)? Within var. 
(Epsilon) = Total var. 
4.12 
= 1 -arn 
8.13 
= 49 
Epsilon = .70 


Whereas Epsilon was applied in Table 1 to the simplest analysis 


of variance design, it applies equally well to complex designs. Each 
Classification produces an Epsilon, which shows directly the dis- 
criminatory power of each (see Peters and Van Voorhis, 1940). 

Epsilon is simply a general measure of correlation. If levels 
within a classification are ordered on a quantitative scale and regres- 
sions are linear, Epsilon reduces to the familiar r. 


A point of view 

Statisticians are not to blame for the misconceptions in psychology 
about the use of statistical methods. They have warned us about the 
use of the hypothesis-testing models and the related concepts. In 
Particular they have criticized the null-hypothesis model and have 
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recommended alternative procedures similar to those recommended 
here. (See Savage, 1957; Tukey, 1954; and Yates, 1951.) 

People are complicated, and it is hard to find principles of human 
behavior. Consequently, psychological research is often difficult and 
frustrating, and the frustration can lead to a “flight into statistics. 
With some, this takes the form of a preoccupation with statistics to the 
point of divorcement from the headaches of empirical study. With 
others, the hypothesis-testing models provide a quick and easy way of 
finding “significant differences” and an attendant sense of satisfaction. 

The emphasis that has been placed on the null hypothesis and its 
companion concepts is probably due in part to the professional milieu 
of psychologists. The “reprint race” in our universities induces us to 
publish hastily-done, small studies and to be content with inexact 
estimates of relationships. 


There is a definite place for small N studies in psychology. A 


chain of small studies, each elaborating and modifying the hypotheses 


and procedures, can eventually lead to a good understanding of a 
domain of behavior. However, if such small studies are taken out of 
context and considered (or published) separately, they usually are of. 
little value, even if null hypotheses are successfully rejected. 

___ Psychology had a proud beginning, and it would be a pity io see 
it settle for the meager efforts which are encouraged by the use of the 
hypothesis-testing models. The original purpose was to find lawful 
relations in human behavior. We should not feel proud when we see the 
psychologist smile and say “the correlation is significant beyond the 


01 level.” Perhaps that is the most that he can say, but he has no 
reason to smile. 
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conclusions vs. decisions* 


j John W. Tukey 


With the exception of appendices 2 and 3, the following is based on the after 
dinner talk given by Professor John W. Tukey at the first meeting ofthe Section of 
the Physical and Engineering Sciences of the American Statistical Association 
held in New York City on May 26, 1955. This talk was repeated at a later date 
before a dinner meeting of the Metropolitan Section of the American Society for 
Quality Control. On both occasions considerable discussion ensued. The talk is 
Published here both for the record, and in the hope that some readers may be 


stimulated to prepare written rejoinders. 


Introduction 


My subject tonight should be both interesting and professionally 
relevant and yet should not involve formulas or a blackboard. Of the 
topics most professionally relevant to statisticians, I must choose 
between human relations, as between statistician and client, and 
statistical philosophy, both subjects where our practices often rD aa 
our formal philosophy, both subjects where more crac An a 
understanding are needed if our practices are to improve as fast as they 
should. 

It is especially importan 
of statistical philosophy be 
development, no matter how impo 
ultimately deflect some, if not all, o 
From Technometrics, Vol. 26 (No. 4), November, 1960. Reprinted by permission 


of the author and the American Statistical Association. 
j h research sponsored by the Office of 


t that our discussion and understanding 
firm and well-balanced. For one-sided 
rtant the single aspect may be, will 
f our practices into unwise bypaths. 


Prepared in part in connection wit 
Naval Research. 
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I have been concerned for.a number of years with the tendency of 
decision theory to attempt the conquest of all statistics. This concern 
has been founded, in large part, upon my belief that science does not 
live by decisions alone—that its main support is a different sort of 
inference. 

Effective discussion of this problem, and a real start toward the 
development of a consensus of opinion, has been retarded by the 
absence of a word for this other sort of inference, a word which could 
be contrasted with “decisions.” For me, there is now a word. (Some 
dislike it, but no one has suggested a better choice.) The word is 
“conclusions.” Conclusion theory is intended, not to replace decision 
theory, but to stand firm beside it, 

_ Because I believe that conclusions are even more important to 
science than decisions, it is particularly appropriate that I am able to 
speak to the first meeting of the ASA’s new Section on Physical and 
Engineering Sciences about the relations, and the differences, between 
eecrions nd conclusions. I know ofno better way to wish the Section 

n to encourage its membership to thought and discussion on a 


topic which I believe will remain im i 
é ortant to th t of the 
functions of all of its members. p eee 


Decisions, what are they? 


Some of us have read about decision theory, most of us have heard of it, 


and all of us make decisions. But do we have a clear idea of what a 
decision-theorist’s decision is? Have the books made the essential 
penton clear? Or have they discussed only the externals of a single 
ee In fact, there has been so little discussion of essentials 
at I have had to formulate my own idea of what a “decision”, in the 
sense of modern decision theory, really is, 
The decisions of practice are far more nearly of the form “let us 


decide to act for the Present as if” than of the form long conventional 
in A ee of decision theory—“we accept.” The distinction is 
important and too often neglected. The restrictions “act . . . as if’ and 


for the present” convey two separate and important ideas, ideas 
which serve to distinguish conclusions from decisions ideas which 
epitomize much of what I wish to Say. i 
_ When an engineer must choose at once between two ways Of 
building a bridge, or a doctor must choose which of two treatments tO 
apply to a patient who is critically ill, or when a businessman must 
choose between two policies for the season that is now upon him, 
each must weigh alternative A against alternative B in this immediate 
situation, and strive to select the alternative that will yield the bigge" 


ee 
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reward, whether this reward be a cheaper safe bridge, a better chance 
of recovery for the patient, or a more profitable season. The possible 
actions are defined, their consequences in various “states of nature” 
are understood, and some evidence about these states of nature is at 
hand. In each instance the individual must judge whether to act as if 
the reward from alternative A will indeed prove to be greater than that 
from alternative B (which we may abbreviate “A > B”), or whether 
the opposite is true (“A < B”). 
The three alternative decisions: 


1l. to act in the present situation as if A > B, 
2. to act in the present situation as if A = B, 
3. to act in the present situation as if A < B 


seem to me reasonably stated, while the conventional statements of 
the alternatives: 


l’. to accept A > B, 
2. to accept A = B, 
3. to accept A < B 


seem to have been (unconsciously) well calculated to mislead the reader 

or student. 
When we say 

the “truth” or “certainty beyon 


“act as if A > B,” we have made no judgment as to 
ni da reasonable doubt” of the statement 

A > B? When we say “for the present,” we are referring only to the 
Particular situation under consideration at present. Thus what we 
have done is to weight both the evidence concerning the relative 
Merits of A and B and also the probable consequences in the present 
Situation of various actions (actions, not decisions!). Finally, we have 
decided that the particular course of action which would be appropriate 
if A were truly >B is the most reasonable one to adopt in the specific 


Situation that faces us. 

When we say “act as if A > B 
We assert no judgment as to the “trut 
able doubt” of the statement “A > B, 


the wisdom of choosing among actions in al t 
situations in which a knowledge that A was truly > B would determine 


a wise man’s choice. The consequences in other situations of acting 
as if A > B have not been considered. It is important that we have 
not done these things; it is perhaps even more important that we know 


that we have not done them. 


» and “in the present situation,” 
h” or “certainty beyond a reason- 
” and we make no judgment about 
l, or even many, of the 
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What has been done is simple and specific. The evidence concerning 
the relative rewards from the alternatives has been weighed : ti 
consequences in the present situation of various actions (not decisions!) 
have been assessed. We have decided that, in this single specific 
situation, the particular action that would be appropriate if A were 
truly >B is the most reasonable action to take. : 

Two sorts of special cases may help to tie down these remarks: 
It is often necessary to make a decision on the basis of no formal data 
at all. (Consider the hen crossing the road!) It may be reasonable to 
make two opposite decisions at the same time with regard to different 
actions. (How many of us both save for our own future and carry life 
insurance, perhaps even in a single policy? One is a decision to act as 
if we will live, the other a decision to act as if we will die!) 

Decisions to “act for the present as if” are attempts to do as well 


as possible in specific situations, to choose wisely among the available 
gambles. 


Conclusions, what 
may they be? 


Like any other human endeay 
but it progresses by the building 
knowledge. (One whose relevance 
grows by the reaching of co 
characteristics differ widely from 
are established with careful reg, 
to consequences of specific actio 
of course, based on specific exp 
are withheld until adequate ev 

A conclusion is a stateme 
to the conditions of an expe: 


or, science involves many decisions, 
up of a fairly well established body of 
e is supposed to be broad.) This body 
nclusions—by acts whose essential 
the making of decisions. Conclusions 
ard to evidence, but without regard 
ns in specific circumstances. (They are, 
eriments or observations.) Conclusions 
idence has accumulated. 

nt which is to be accepted as applicable 


f riment or observation unless and until 
unusually strong evidence to the contrary arises. This definition has 


three crucial parts; two explicit, and the third implicit. It emphasizes 
“acceptance”, in the original, strong sense of that word; it speaks of 
“unusually strong evidence”; and it implies the possibility of later 
rejection. 

First, the conclusion is to be acce 
knowledge, not just into the guidebo 
as would be the case with a decisio 
extracted from the data. 

Indeed, the conclusion is to remain 
unusually strong evidence to the contrary a 
a small percentage of all conclusions will, 


pted. It is taken into the body of 
ok of advice for immediate action. 
n. It is something of lasting value 


accepted, unless and until 
rises. This implies that only 
in due course, be upset. 
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Third, a conclusion is accepted subject to future rejection, when 
and if the evidence against it becomes strong enough. (Only a small 
proportion of conclusions will be rejected.) It is taken to be of lasting 
value, but not necessarily of everlasting value. 

These characteristics are very different from those of a decision- 
theorist’s decision. The differences are extremely important. 

It has been wisely said that “science is the use of alternative 
working hypotheses.” Wise scientists use great care and skill in 
selecting the bundle of alternative working hypotheses they use. 
Conclusions typically reduce the spread of the bundle of those working 
hypotheses which are regarded as still consistent with the observations. 
Hence conclusions must be reached cautiously, firmly, not too soon 
and not too late. And they must be judged by their long run effects, 
by their “truth,” not by specific consequences of specific actions. 


Statistical vs. 
€xperimenter’s conclusions 


As statisticians we must insist upon more than one kind of conclusion, 


upon the difference between “statistical conclusions” and “experi- 
menter’s conclusions.” A “statistical conclusion” applies to the actual 
Conditions of the experiment. Ifa consistent blunder were made, if the 
instruments or measurements yield substantial systematic errors (they 
will always have some systematic errors, though we may hope that 
these are small), if the measurements were reduced according to a 
theory which is incomplete in some important way (it will always be 
incomplete to a certain extent), if the conditions or measurements were 
Incorrectly recorded, if the importance of important variables were 
Not recognized (so that their values were not recorded or reported), the 
Stated conclusions are likely to be wrong. hoe for such reasons 
are not to be charged against statistical conclusions. | ' 
But experimenters conclusións be they physical Sane 
chemical conclusions, biological conclusions Or engineering conclu- 
sions, must take account of all these possibilities. In most areas of 
experiment or observation it will be either desirable or necessary for 


the experimenter to make specific allowance, beyond the statistically 
recognizable uncertainty, for such deviations of the actual ae 
from the supposed situation. For this reason, his conclusions will be 
Weaker than the statistical ones. 

This dirence, which arises from what may loosely be called the 
Problem of systematic error, is an important challenge to the statistician. 
Both the statistician’s morale and his integrity are tested when, for 
example, he has to face the possibility of a really substantial systematic 
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error just after he has used all his skill to reduce, in the same exponen 
the effects of fluctuating errors to 95 % of their former va v2 i 
challenges his relationship to his clients in two opposite yore Kaa 
his client is quantitatively sophisticated, as many p sat $a 
engineering scientists are, he must face the systematic errors or s = 
client’s respect. When his client is not quantitatively sophisticated, : 
is often the case in other fields, he must educate the client at the proper 
rate, not too rapidly and not too slowly—first, perhaps abou 
fluctuating errors, but eventually about systematic errors, too! 


Asymmetry can be essential 


We have emphasized the most important differences between decisions 
and conclusions. There is another difference which is not quite among 
the most important, but which yet deserves a place of its own. This is 
the treatment of doing nothing. KE 

In most accounts of decision theory, the decision to do nothing 1S 
either ignored (which is probably the worst thing to do in practice) or 
treated on a par with all the other decisions. In conclusion theory, on 
the other hand, not coming to a conclusion plays a very special role. 
Three instances may help us to reflect on this distinction: 


1. All of us who were or 
science feel quite clearly, 
different from other attit 


iginally brought up in physical or biological 
Tam sure, that “to be not yet certain” is very 
udes about a question. 

2. We may be surprised to find a related 
tors—Chester I. Barnard, on 
Executive [1] says (his italics) ‘ 


attitude among administra- 
page 194 of The Functions of the 


‘The fine art of executive decision (core 
sists in not deciding questions that are now not pertinent, in not deciding 


prematurely, in not making decisions that cannot be made effective, and 
in not making decisions that others should make.” 


3. An active worker in decision the 
decision to do nothing was “ 


Each of these emphasi: 
of “doing nothing.” 
will treat (1).] a6 

Barnard’s statement implies that the “decisions” of the pang 
are much more nearly what we have called conclusions than what W d 
have called decisions. They are not to be entered upon lightly, an 
there is a clear implication that, once reached, they are to be referred tO 
for some time as part of a growing body of doctrine. 


ory told me recently that the 
the only decision without a loss function. 


s , team T 
zes, in a different way, the distinctive ee 
Each deserves further examination. [Appendi 
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The decision-theorist’s statement in (3) reveals him, it seems to me, 
as one who is really in search of conclusions. Why else is “doing 
nothing” so different? It is an action, one that can, in particular, lose 
money. 

Decision theory ought to be symmetrical with regard to the action 
“do nothing.” Conclusion theory must be unsymmetrical with regard 
to the action “conclude nothing.” 


Tests of significance 


The prototype of modern experimental statistics was the test of 
significant difference. It came first as a tool of analysis and inference, 
not as a tool of mere description. When we examine its purport in 
the framework we are describing, we find that it is a qualitative con- 
clusion procedure. Its purpose is to answer the question “Dare we 
conclude that this difference is not zero?” 

We may, on the basis of a test of significance, conclude that A # B, 
or even more specifically that A < Bor A > B. But failure to attain 
significance is not, of itself, intended to produce a conclusion, is not 
intended to be accepted, in that strong sense of the word “accept” 
which is relevant to conclusions. 

Where do we stand when the difference between A and B has not 
reached “significance?” Some would like to wield Occam’s razor and 
say that “We have shown that A = B.” Surely we have not concluded 
that A = B. For no quantitative evidence can establish that A is 
not just a very little different from B. Perhaps we have decided that 
A = B, but if so, for what specific situation, on what evidence, and 
with what assessment of consequences? 

To interpret appropriately a failure to attain significance, it is 
necessary to know something about the precision of the comparison, to 
know how close there is reason to believe A is to B. Only by advancing 
into the use of confidence techniques (about which more anon) can a 
negative statement about significance be converted into a positive 


conclusion of established smallness of difference. 


Tests of hypothesis 

and mathematical simplicity, seemed to lead along a 
nificance to tests of hypothesis. As the 
few if any stopped to see where they had 
litative conclusion procedure 


Symmetry, 
straight path from tests of sig 
Procession traversed this path, 
gone—to notice that they had left a qua! 
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and had come to what was suspiciously like a qualitative decision 
procedure. ; i i 

The choice between two simple hypotheses can be viewed in two 
quite different ways: 


1. asan attempt to choose the best risk, without regard to certainty— 
which is surely a decision procedure, or 


2. as an attempt to control, often by a sequential procedure, both 
kinds of error (both the error of accepting the hypothesis when it is 


false, and the error of rejecting it when it is true) at suitably low levels— 
which is, on the face of it, a conclusion procedure. 


The aim of (2) can be expressed as follows: “We will take enough 
observations to allow us to dare to conclude either that the first 
hypothesis is false, or that the second hypothesis is false, but we shall 
not try to conclude that both are false, even if the observations prove 
adequate to do this.” The form of this statement is clearly that of a 


conclusion procedure, though it is natural to wonder at the presence 
of its last proviso. 


} eally (1), to choose the best risk, 
then there is no real place in the procedure for the artificial limitations 
of 5%, of 1% or of any of the conventional significance or confidence 

Ing Is to be concluded, only something decided, there is 
of error. (Only the mathematical 
ositive to make a small gamble 


" 
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Probably the greatest ultimate importance, among all types of 
Statistical procedures we now know, belongs to confidence procedures 
which, by making interval estimates, attempt to reach as strong 
Conclusions as are reasonable by pointing out, not single likely values, 
but rather whole classes (intervals, regions, etc.) of possible values, so 
chosen that there can be high confidence that the “true” value is 
Somewhere among them. Such procedures are clearly quantitative 
Conclusion procedures. They make clear the essential “smudginess” 


of experimental knowledge. 


The twin dichotomies 

Keeping the varied sorts of statistical inference procedures separate, and 
yet properly related to one another, is important to every statistician. 
Hopefully, the distinction between decisions and conclusions, as well 
as the distinction between qualitative and quantitative, are now clear. 

The writer has found, and continues to find, these twin dichotomies 
(qualitative-quantitative and conclusion-decision) most helpful in 
Organizing the procedures of statistics into a pattern which is useful 
both for application and reflection. PEN 

Surely the quantitative is preferable to the qualitative whenever 
both are equally available and equally relevant. Thus most qualitative 
statistical procedures are interim measures, introduced to serve until 
equally relevant quantitative procedures become available. m 
___ Ifweuse the phrases “to do one’s best” and “to state only that whic 
1S certain” as typifying decisions on the one hand, and conclusions o 
the other, we can sce that there is a real par for both. And in particular 
situations we can usually tell what these places are. E? 

To sum things up: Xhe case of qualitative vs. quanta ie shovis 
have a mixed verdict, granting “qualitative” squatters rights, v one 
until “quantitative” is ready to move in; while the case of rpn re 
vs. decisions should be settled out of court, with an unders er 
that cooperation is vital to both parties. There is a place for ioa 
“doing one’s best” and “saying only what is certain, but it 5 inpor a t 
to know, in each instance, both which one is being done, and which on 


Ought to be done. 
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Some confusing relations 


So long as we lacked clearly contrasted words, confusion betgeen 
decisions and conclusions was very easy, especially since they pe: 
thoroughly combined, both so frequently, and in almost every possi 
een decisions and conclusions are required in almost every field 
of human endeavor, yet the proportions, mutual relations and relative 
dominance which are appropriate vary greatly from one field to 
another. The aim and purpose of pure science lies in the conclusions 
which build up knowledge. Yet these conclusions are reached because 
individual scientists decide to attack certain problems in certain ways. 
(They rarely, if ever, know enough to conclude which problems a 
should attack, or how.) In most fields of engineering much must depen 
on the wisdom of experience, on engineering judgment, on engineering 
decisions. Yet these decisions are built upon the conclusions of pure 
and applied science. Engineering uses decisions fortified with con- 
clusions, just as science uses decisions to reach conclusions. d 
In statistics, too, conclusions and decisions are interrelated i 
intertwined. It is not infrequent that we come to conclusions abou 
decision procedures. What may prove to be one of the greatest monu- 
ments to Abraham Wald’s memory is the notion of admissibility. 
And one ofits more important elements is the fact that we may conclude 
(in this instance purely from theory and presuppositions) that on 
decision procedure is always worse than another., as 
We have seen that point estimates may reasonably be regarded i 
decisions. If we have a situation in which alternative point estimas 
are investigated by experimental sampling, and if the sampling A 
continued until the effects of sampling fluctuations fall peon 
prechosen standard of smallness, we are really experimenting san ir 
can reach a conclusion about competing decision procedures. A i rs 
instance, closer in feeling to the first, is provided by R. A. Fis ae 
classical paper of 1920, “A Mathematical Examination of the eos 
of Determining the Accuracy of an Observation by the Mean Err 


ae arisons 
and by the Mean Square Error” [2], one of many objective compariso 
of estimators. 


On the other hand 
procedures. Some of 
data?” is a question 
which the statistician 


, all of us make decisions about conclusies 
us do it every day. “How is it best to analyze e: 
which cannot be left to the experimenter “the 
is bound by his profession to try to answer. 
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answer should clearly be a procedure to provide a conclusion, then he 
must do something about a conclusion procedure. Does he decide 
about it, or conclude about it? As an inherent conservative (profes- 
sionally, anyway!), he would like to conclude. But will he have enough 
firm evidence? Often he will not! 

When a transformation is chosen, whether for an analysis of 
Variance, for a quantal response assay, or for some other statistical 
Procedure, how often does the chooser know what is the best transfor- 
mation? In a theoretical sense, the answer is “never,” for he will have 
only a finite amount of information—since his estimate of “best” 
will have a finite standard deviation—and transformations can be 
varied in arbitrarily small steps. In practice, it must be recognized that 
exactly the best transformation is not required, so that such an argument 
is not compelling. Yet, even in a practical sense, the. answer is “not 
nearly often enough,” for adequate information is often, or even 
usually, lacking. Who knows ofan instance, to take a concrete example, 
Where the choice between probits and logits for a quantal response 
assay was a conclusion and not a decision? ie 

In handling complex data by analysis of variance, how shall we 
Set up the analysis? How detailed shall be our computations? On what 
orthogonal functions shall we calculate regressions? Can any of you 
recall situations where the answer to any of these was a conclusion? 


APPENDIX 2 


Conclusion theory 
as an action system 
Insofar as man’s organized activities can be regarded as oe 
toward at least dimly recognized goals, it is easy to peas Pe, 
individual actions which make up these activities stead Ge taken in 
some appropriate form of decision theory. Actions are 9 S Dedine 
Specific instances, and the gains or losses resulung acini. be at 
Combinations of actions and states of pre eee fer cnnclasien 
le ould there be a p: i 
ast roughly assessed. Why then sh bea poor substitute 


theory, which seems from sucha broad viewpoint to bi 
for what is really needed? At least four classes of important reasons 


loom up over the horizon: problems of communication, problem of 
assessment of gain and loss, problems of assessment of the a priori, 


Problems of adequate mathematical treatment. — mt. 
Most Ferien affairs are not conducted by a single individual, nor 


even by a single executive hierarchy. Science, in the broadest sense, is 
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both one of the most successful of human affairs, and sane R 
decentralized. In principle, each of us puts his evidence ( =e o ca 
tions, experimental or not, and their discussion) before all the or i 
and in due course an adequate consensus of opinion develops. nee 
early decades of the Royal Society of London, this was eg ih abe 
nearly how things were. But the number of working scientists 
doubled, and redoubled very many times since then. Asa sow igs Xe 
problems of communication have probably come to annat a 
problems of scientific method. And the practices of science ale 
developed to meet the challenge. Outstanding among these prac’ ; i 
isthe use ofconclusions. A scientist is helped little to know that anothe ; 
given different evidence and facing a different specific mage 
decided (even decided wisely) to act as if so-and-so were the true sta 
of nature. The communication (for information, not as directives) te 
decisions is often inappropriate, and usually inefficient. SD T 
helped much to know that another reached a certain conclusion, that i 
felt that the correctness of so-and-so was established with high con 


3 s ni- 
dence, In order to replace conclusions as the basic means of commu 
cation, it would be necessar 


of science. No statistician 
basis of his limited area of specialized knowledge. 


But suppose a new fabric of science were to be developed. How 
could the old be compare 


that rapidity of progress i 
sake alone does not mea: 
science are being neglected in com 
Can one judge now how fa 


ange to a new fabric of science, one be 
More explicitly on decision-theoretic principles, how would the € mae 
among many such fabrics be made? There would be a need to eae 
something like an a priori state of the whole world, more por iie 
to choose an a priori distribution of probability over all the poss a 
states of the whole world, since just as the admissible decision Pi 
cedures are the Bayesian solutions (those solutions which are L af 
for suitable assumptions about the a priori probabilities ofall “stat be 
nature” considered) so too the admissible decision fabrics are a 3 
expected to be Bayesian fabrics. And it is a little too much to as rd, 
those who have learned to study certain limited aspects of the wo 
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and who are striving to learn a little more of these aspects, that they 
envisage all possible worlds and distribute probability among them. 
Finally, there are problems of adequate mathematical treatment. 
Statistics can solve a few vastly over-simplified problems in great 
generality or great detail, but it has barely begun to chew out a few 
little entrances into many problems of moderate difficulty. Problems 
of the order of difficulty of finding a Bayesian fabric, given the gains and 
losses, are wholly outside its present grasp. Today it has not provided 
€ven a beginning of an answer for such vastly simpler problems as: 
given samples of moderate size from each of two populations, given that 
the populations are so nearly normal (i.e, Gaussian) that samples of 
1000 have no more than an even chance of detecting (at 5 % significance) 
that the populations are not normal, and (even) given that the popula- 
tions are symmetrical, what is the safest way to compare the centers of 
the two populations on the basis of the samples, where safety combines 
(1) reasonable reliability of significance or confidence percentages and 
(2) avoidance of procedures which are relatively very wasteful for 
Particular population shapes. (Notes: (a) Even these many words, of 
Course, have not completely specified a problem. (b) Adding a prob- 
ability distribution over shapes to the hypotheses seems unlikely to 
make the problem easier.) E y 
_ There are four types of difficulty, then, ranging from communica- 
tion through assessment to mathematical treatment, each of which by 
itself will be sufficient, for a long time, to prevent the replacement, in 
science, of the system of conclusions by a system based more closely on 
today’s decision theory. Once these four have been examined, the 
natural question becomes: “How did the conclusion system escape the 
parallel sets of difficulties?” The answer is simple and clear. It grew. 
This means that it evolved; that many minor alternatives were tried, 
often unconsciously, that most were found wanting, and were discarded; 
that this process of trial and selection went through cycle after cycle. 
he strength of the process of science today comes from experience 
rather than insight, and this state of affairs may be expected to conta 
for a long time. Indeed, it will not be easy to gain the limited insight 
required to understand how the present processes of science do as well 


as they do. 


APPENDIX 3 


What of tests of hypothesis? 
d insistence on “inductive behavior,” 


In vie , : 
w of Neyman’s continue! é ; 
y lly to decisions than to conclusions, it 


Words which relate more natura 
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is reasonable to suppose that the Neyman-Pearson theory of testing 
hypotheses was, at the very least, a long step in the direction of decision 
theory, and that the appearance of 5 % 1% and the like in its develop- 
ment and discussion was a carryover from the then dominant qualitative 
conclusion theory, the theory of tests of significance. If this view is 
correct, Wald’s decision theory now does much more nearly what 
tests of hypothesis were intended to do. Indeed, there are three ways in 
which it does better. First, it has given up a fixed probability for errors 
of the first kind, and has focussed on gains, losses or regrets (be they 
average or minimax). Secondly, it has made it somewhat easier to 
consider a much wider variety 
stringent assumptions. And finally it has shown that one should expect 
mathematics to provide, not a 
assortment of good procedures 


rge samples, cf. [3].) 


theory. (And the Neyman-Pearson lemma can serve, in its prope’ 
place, in both.) 3 
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The theory of Probability and statistical inference is various things tO 
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calculus, to be explored and developed with little professional concern 
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experimentalist, who has specialized along other lines, seldom feels 
competent to extend criticisms or even comments; he is much more 
likely to make unquestioning application of procedures learned more 
or less by rote from persons assumed to be more knowledgeable of 
statistics than he. There is, of course, nothing surprising or repre- 
hensible about this—one need not understand the principles of a 
complicated tool in order to make effective use of it, and the research 
scientist can no more be expected to have sophistication in the theory of 
statistical inference than he can be held responsible for the principles of 
the computers, signal generators, timers, and other complex modern 
instruments to which he may have recourse during an experiment. 
Nonetheless, this leaves him particularly vulnerable to misinterpreta- 
tion of his aims by those who build his instruments, not to mention the 
ever present dangers of selecting an inappropriate or outmoded tool 
for the job at hand, misusing the proper tool, or improvising a tool of 
unknown adequacy to meet a problem not conforming to the simple 
theoretical situations in terms of which existent instruments have 
been analyzed. Further, since behaviors once exercised tend to 
crystallize into habits and eventually traditions, it should come as no 
surprise to find that the tribal rituals for data-processing passed 
along in graduate courses in experimental method should contain 
elements justified more by custom than by reason. ; 

In this paper, I wish to examine a dogma of inferential procedure 
which, for psychologists at least, has attained the status of a religious 
Conviction. The dogma to be scrutinized is the null-hypothesis 
Significance test” orthodoxy that passing statistical judgment on a 
Scientific hypothesis by means of experimental observation is a 
decision procedure wherein one rejects or accepts a null hypothesis 
according to whether or not the value of a sample statistic yielded by 
an experiment falls within a certain predetermined rejection region 
Of its possible values. The thesis to be advanced is that despite the awe- 
Some pre-eminence this method has attained in our area 
journals and textbooks of applied statistics, it 1s based upon a me a- 
mental misunderstanding of the nature of rational pea bs is 
Seldom if ever appropriate to the aims of scientific a ; ee 
Nota particularly original view—traditional null-hypot re aom n 
has already been superseded in modern statistical theory by E y 
more satisfactory inferential techniques. But the peeh lig 
of Psychologists are particularly efficient when dealing with matters al 
Methodology, and so the statistical folkways of a more primitive pas 
conti : , 

ntinue to dominate the local scene a detail. mii enone 


To examine the method in questi 3 ad ex] 
Some of the discomfitures to which it gives rise, let us begin with a 


YPothetical case study. 
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A case study in null- 
hypothesis procedure; or, 
a quorum of embarrassments 


; ost 
Suppose that according to the theory of behavior, E 
right-minded, respectable behaviorists, the extent to whicl ee 
behavioral manipulation M facilitates learning in a cerala E dca 
learning situation C should be null. That is, if p Ceia usde 
to which manipulation M facilitates the acquisition of habi ip 
circumstances C, it follows from the orthodox theory To n fated 
Also suppose, however, that a few radicals have persistently a r the 
an alternative theory T, which entails, among other things, ciably 
facilitation of H by M in circumstances C should be appre aa 
greater than zero, the precise extent being dependent upon the beri, 
of certain parameters in C. Finally, suppose that Igor PoR an 
graduate student in Psychology, has staked his dissertation hopes ¢ 


: di tial 
against T, on the basis of their differen 
predictions about the value of ġ. 


compare their efficiency ipulation 
S’s who, under circumstances C, have not been exposed to mae ee 
M. The difference, d, between experimental and control S’s in 


1950 
S ings 
contro p Ose customarily subsumed under the ce 
individual differences” and “errors of measùrement.” To ¢ 

ong mathematical stor 


s ` 3 ossi 
8 Y, It turns out that with suitable (P 
Justified) assumptions abo 


t 3 e 
"ncontrolled variables, the manner in which they influenc 
dependent variable, and i 

: 
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of ¢, is unbiasedly estimated by the square of another sample statistic, 
s, computed from the data of the experiment. 

_ The import of these statistical considerations for Hopewell’s 
dissertation, of course, is that he will not be permitted to reason in any 
simple way from the observed d to a conclusion about the comparative 
merits of Tọ and T,. To conclude that To, rather than T}, is correct, 
he must argue that @ = 0, rather than ¢ > 0. But the observed d, 
whatever its value, is logically compatible both with the hypothesis 
that = 0 and the hypothesis that ọ > 0. How then, can Hopewell 
use his data to make a comparison of Tọ and T,? As a well-trained 
student, what he does, of course, is to divide d by s to obtain what, 
under Hp, is a t statistic, consult a table of the t distributions under the 
appropriate degrees-of-freedom, and announce his experiment as 
disconfirming or supporting Tọ, respectively, according to whether 
Or not the discrepancy between d and the zero value expected under 
To is “statistically significant”—i.e., whether or not the observed 
value of d/s falls outside of the interval between two extreme percentiles 
(usually the 2.5th and 97.5th) of the t distribution with that df. If 
asked by his dissertation committee to justify this behavior, Hopewell 
would rationalize something like the following (the more honest 
reply, that this is what he has been taught to do, not being considered 


4ppropriate to such occasions): 


In deciding whether or not To is correct, I can make two types 
of mistakes: I can reject To when it is in fact correct [Type I error}, or 
II error]. As a scientist, 


I can accept Ty when in fact it is false [Type r 

I have a professional obligation to be cautious, but a 5% chance of error 
is not unduly risky. Now if all my statistical background assumptions 
are correct, then, if it is really true that ọ = 0 as Ty says, there is only 
One chance in 20 that my observed statistic d/s will be smaller than 
Lo2s or larger than ae where by the latter I mean, respectively, the 
2.5th and 97.5th percentiles of the t distribution with the same dears, 
of-freedom as in my experiment. Therefore, if I reject To when /s is 
smaller than t 025 or larger than t.975> and accept To otherwise, there is 


only a 5% chance that I will reject To incorrectly. 


Ifasked about his Type II error. and why he did not choose some olaa 
rejection region, say between t475 and t.525, Which would ay t E 
Same probability of Type I error, Hopewell should reply that a n ouen 
© has no way to compute his probability of Type II error poe! F 
assumptions traditionally authorized by null-hypothesis procedure, 


i ifference in means, not 
L. s is here the estimate of the standard error of the diff 


the estimate of the individual SD. 
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is presumably minimized by taking the rejection region at the extremes 
of the t distribution. 

Let us suppose that for Hopewell’s data, d = 8.50, s = 5.00, and 
df = 20. Then t.97; = 2.09 and the acceptance region for the null 
hypothesis = 0 is —2.09 < d/s < 2.09, or —10.45 < d < 10.45. 
Since d does fall within this region, standard null-hypothesis decision 
procedure, which I shall henceforth abbreviate “NHD,” dictates that 
the experiment is to be Teported as supporting theory Ty. (Although 
many persons would like to conceive NHD testing to authorize only 
rejection of the hypothesis, not, in addition, its acceptance when the 
test statistic fails to fall in the rejection region, if failure to reject were 
not taken as grounds for acceptance, then NHD procedure would 
involve no Type II error, and no justification would be given for taking 
the rejection region at the extremes of the distribution, rather than in 
its middle.) But even as Hopewell reaffirms Ty in his dissertation, he 
begins to feel uneasy. In fact, several disquieting thoughts occur to him: 


1. Although his test statistic falls 


È within the orthodox acceptance 
region, a value this divergent from the expected zero should nonethe- 
less be encountered less than once in 10. To argue in favor of a hypoth- 
esis on the basis of data ascribed a p value no greater than .10 
(ie. 10%) by that hypothesis certainly does not seem to be one of the 
more impressive displays of scientific caution. 


an actually be computed. Suppose the value 


way is ġ = 10.0. an taking 
= 0 as the null hypothesy, $ = 10.0. Then, rather than g 


od under T, of obtaining a test 
xpected 10.0 is a most satisfactory 
rs to Hopewell that had he chosen to 
ists by selecting @ = 10.0 as his 
de a strong argument in favor O 
atistical reasoning he has used tO 
hypothesis. That is, he could have 
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i z 
E a oy argument that persons partial to T, would regard as strong. 
ae x aviorists who are already convinced that Tọ is correct would 
ae . at since To is the dominant theory, only ġ = 0 is a legitimate 
Pe a dba (And is it not strange that what constitutes a valid 
argument should be dependent upon the majori ini 
about behavior theory?) s $ eae 


Et erie to the- NHD test of a hypothesis, only two possible 
Rr gee the experiment are recognized—either the hypothesis 
rogue cted or it is accepted. In Hopewell’s experiment, all possible 
nent z of d/s between —2.09 and 2.09 have the same interpretive 
eee namely, indicating that ¢@ = 0, while conversely, all 
th ible values of d/s greater than 2.09 are equally taken to signify 

at @ #0. But Hopewell finds this disturbing, for of the various 
ae values that d/s might have had, the significance of d/s = 1.70 
oth e comparative merits of Ty and T, should surely be more similar 

at of, say, d/s = 2.10 than to that of, say, d/s = — 1.70. 


s to Hopewell that had he 


4. i par 
In somewhat similar vein, it also occur: 
vel, say a Type I error of 


BA for a somewhat riskier confidence level, i 
and rather than 5%, d/s would have fallen outside the region of accept- 
R and Ty would have been rejected. Now surely the degree to which 

atum corroborates or impugns a proposition should be inde- 
co of the datum-assessor’s personal temerity. Yet according to 
as odox significance-test procedure, whether or not a given experi- 
i ntal outcome supports or disconfirms the hypothesis in a question 

epends crucially upon the assessor’s tolerance for Type I risk. 


Despite his inexperience, Igor Hopewell is a sound experimentalist 


at heart, and the more he reflects on these statistics, the more dissatisfied 
he exigencies of graduate 


a his conclusions he becomes. So while t es of E 
be is amen and publication requirements urge that his dissertation 
5 ritten as a confirmation of To, he nonetheless resolves to keep an 

pen mind on the issue, even carrying out further research if opportunity 


Permits, And reading his experimental report, SO of course would we— 
de up his mind about such a matter 


fae responsible scientist ever made UP Tt” l 
É e basis of a single experiment? Yet in this obvious way We reveal 
Ow little our actual inferential behavior corresponds to the statistical 
P poedure to which we pay lip-service. For if we did, in fact, accept or 
falle the null hypothesis according to whether the sample statistic 
r Sin the acceptance or in the rejection region, then there would be no 
plications of experimental designs, nO multiplicity of experimental 
approaches to an important hypoth gle experiment would, 


esis—a sin ; 
Y definition of the method, make u bout the hypothesis 


p our mind a 
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in question. And the fact that in actual practice, a single ss 
seldom even tempts us to such closure of judgment reveals how little 
the conventional model of hypothesis testing fits our actual evaluative 
behavior. 


Decisions ys. 
degrees of belief 


By now, it should be obvious that something is radically amiss with the 
traditional NHD assessment of an experiment’s theoretical import. 
Actually, one does not have to look far in order to find the trouble—it 1s 
simply a basic misconception about the purpose of a scientific experi- 
ment. The null-hypothesis significance test treats acceptance or 
rejection of a hypothesis as though these were decisions one makes on 


the basis of the experimental data—i.e., that we elect to adopt one 
belief, rather than another, as a result o 
But the primary 


decisions, but to 


are voluntary 
acceptance or 
provide the basi 
such a decision ( 
further experien 


The situation, in other wor 


—which are supported by these data. 
a proposition is not an all-or-none 


Prospects, the higher the odds he will 


demand before betting. That is, the extent to which Smith accepts OF 
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rejects the hypothesis that War Biscuit will win the fifth at Belmont is 
an important determinant of his betting decisions for that race. 

Now, although a scientist’s data supply evidence for the conclusions 
he draws from them, only in the unlikely case where the conclusions 
are logically deducible from or logically incompatible with the data 
do the data warrant that the conclusions be entirely accepted or 
rejected. Thus, e.g, the fact that War Biscuit has won all 16 of his 
previous starts is strong evidence in favor of his winning the fifth at 
Belmont, but by no means warrants the unreserved acceptance of this 
hypothesis. More generally, the data available confer upon the 
conclusions a certain appropriate degree of belief, and it is the inferen- 
tial task of the scientist to pass from the data of his experiment to 
whatever extent of belief these and other available information justify 
in the hypothesis under investigation. In particular, the proper 
inferential procedure is not (except in the deductive case) a matter of 
deciding to accept (without qualification) or reject (without qualifica- 
tion) the hypothesis: even if adoption of a belief were a matter of 
Voluntary action—which it is not—neither such extremes of belief or 
disbelief are appropriate to the data at hand. As an example of the 
disastrous consequences of an inferential procedure which yields 
only two judgment values, acceptance and rejection, consider how sad 
the plight of Smith would be if, whenever weighing the prospects 
for a given race, he always worked himself into either supreme confi- 
dence or utter disbelief that a certain horse will win. Smith would 
rapidly impoverish himself by accepting excessively low odds on 
horses he is certain will win, and failing to accept highly favorable odds 
on horses he is sure will lose. In fact, Smith’s two judgment values need 
Not be extreme acceptance and rejection in order for his inferential 
Procedure to be maladaptive. All that is required is that the ae 
of belief arrived at be in general inappropriate to the likelihoo 


conferred on the hypothesis by the data. 
i is ief appropriate to the data at 
Now, the notion of “degree of belief approp. eto eee 


hand” has bjective feel abou 
5 an unpleasantly vague, subjective, € 
it unpalatable for inclusion in a formalized theory of inference. 


Fortunately, a little reflection about this phrase reveals it to be 
intimately connected with another concept relating con to 
evidence which, though likewise in serious need of conceptual clari oa 
tion, has the virtues both of intellectual respectability and srania 
familiarity. I refer, of course, to the likelihood, or probability, oe 
upon a hypothesis by available evidence. Why should not : E A. 
certain, in view of the data available, that War Bisen wi = 
fifth at Belmont? Because it is not certain that War : ien ie R 
ore generally, what determines how strongly We $ 
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reject a proposition is the probability given to this hypothesis by the 
information at hand. For while our voluntary actions (i.e., decisions) 
are determined by our intensities of belief in the relevant propositions, 
not by their actual probabilities, expected utility is maximized when 
the cognitive weights given to potential but not yet known-for-certain 
pay-off events are represented in the decision procedure by the 
probabilities of these events. We may thus relinquish the concept of 
“appropriate degree of belief” in favor of “probability of the hypothe- 
sis,” and our earlier contention about the nature of data-processing may 
be rephrased to say that the proper inferential task of the experimental 
scientist is not a simple acceptance or reject 


In brief, what is bein 
to prescribe actions but 
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The methodological status 
of the null-hypothesis 
significance test 


The preceding arguments have, in one form or another, raised sev 
doubts about the appropriateness of conventional significance aT 
decision procedure for the aims it is supposed to achieve. It cat 
time to bring these charges together in an explicit bill of indictment. 


1. The null-hypothesis significance test treats Le a 
“rejection” of a hypothesis as though these were decisions en 
But a hypothesis is not something, like a piece of pie offered = 3 tion: 
Which can be accepted or rejected by a voluntary physica pe ree 
Acceptance or rejection of a hypothesis is a cognitive process, Tah 
of believing or disbelieving which, if rational, is not a matter pier the 
but determined solely by how likely it is, given the evidence, 
hypothesis is true. 


2. It might be argued that the NHD test may nonetheless be ee 
asa legitimate decision procedure if we translate Bera Siena aor were 
of the hypothesis” as meaning “acting as though the hypot hich one 
true (false).” And to be sure, there are many occasions on Whi thesis. 
Must base a course ofaction on the credibility of a scientific Polis 
Should these data be published? Should I devote my research re 


i F ? Can we 
to and become identified professionally with this pan But 
test this new Z bomb without exterminating all life on ises two 
Such a 


Move to salvage the traditional procedure only Leite 
further objections. (a) While the scientist—i.e., the per robable) 
indeed make decisions, his science is a systematized body 0! te ofa 
‘nowledge, not an accumulation of decisions. The - K Tropos 
Scientific investigation is a degree of confidence in some Cb) Decision 
lons, which then constitutes a basis for decisions. { a decision 
theory shows the NHD test to be woefully inadequate as 


i when not to 
Procedure, In order to decide most effectively when both the proba- 
vct as though a hypothesis is correct, one must know 


t A F tilities of 
ility of the hypothesis under the data available and the uti 


) ing the hypothe- 
; © Various decision outcomes (ie, the values hoe nE it when 
SIS When it is true, of accepting it when it is false, of Te) 


r aditional NHD 
"tis true, and of rejecting it when it is false). ees the 
Procedure pays no attention to utilities at all, the inverse proba- 
probability of the hypothesis, given the data—i-. 


s ‘ aking the rejection 
ility— only in the most rudimentary way (by ani its middle). 


region at the extremes of the distribution rather t 


230 Research Problems in Psychology 


Failure of the traditional significance test to deal with inverse proba- 
bilities invalidates it not only as a method of rational inference, but 
also as a useful decision procedure. 


3. The traditional NHD test unrealistically limits the significance of 
an experimental outcome to a mere two alternatives, confirmation 
or disconfirmation of the null hypothesis. Moreover, the transition 
from confirmation to disconfirmation as a function of the data is 
discontinuous—an arbitrarily small difference in the value of the test 
statistic can change its significance from confirmatory to disconfirma- 
tory. Finally, the point at which this transition occurs is entirely 
gratuitous. There is absolutely no reason (at least provided by the 
method) why the point of Statistical “significance” should be set at the 
95% level, rather than, say the 94% or 96% level. Nor does the fact 


that we sometimes select a 99% level of significance, rather than the 


pave 95% level, mitigate this objection—one is as arbitrary as the 
other. 


YP! nnocent unless proved guilty, while any 
alternative is held guilt i : eg ee A i 
z 1 y until no ch e it 
innocent. What is objectos Olce remains but to judg 


é able here is not that some hypotheses are 
‘ ‘ nt to experimental extincti 
the differential weighing is an all-or-. 


A none side effect ofa personal choice, 
ea S aly, ou the method necessitates one W pateats being 

cae a eothers. In the classical theory of inverse probability. 
on the other hand, all hypotheses are treated on a par, each receiving a 
weight (i.e., its “a priori” Probability) which reflects the credibility of 
that hypothesis on grounds other than the data being assessed. 
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and that z 
maea o has beer which p = .06, even though the point of “sig- 
undisturbed by e p = .05? In fact, the reader may well feel 
procedure precisel b arges raised here against traditional NHD 
never taken the y because, without perhaps realizing it, he has 
the most firm] baer va seriously anyway. Paradoxically. it is often 
to eoue de oaled tenet of faith that is most susceptible 
with sacrosanct sregard—in our culture, one must early learn to live 
seldom heeded verbal formulas whose import for practical behavior is 
significance testin og ke that the primary reasons why null-hypothesis 
surcease of meth £ as attained its current ritualistic status are (a) the 
algorithm on th e ological insecurity afforded by having an inferential 
rithm is so E ooks, and (b) the fact that a by-product of the algo- 
the latter can b , and its end product so obviously inappropriate, that 

e ignored without even noticing that this has, in fact, 


been done 
usefulness. t has given the traditional method its spurious feel of 
Procedure, Ael the first, and by far most laborious, step in the 
Outcome under sm estimating the probability of the experimental 
also a crucial fir: "ve assumption that a certain hypothesis is correct, is 
panel, an idea pig toward what one is genuinely concerned with, 
ental Sittin a likelihood of that hypothesis, given this experi- 
: Ormation ies aving obtained this most valuable statistical 
Bae test, it is th pretext of carrying through a conventional signif- 
heap honor ¿ T tempting, though of course quite inappropriate, 
and gratitude upon the method while overlooking 


at its q 
Ctual re, ; 
at all, esult, namely, a decision to accept or reject, is not used 


Tow, 

‘are 

appr. rd a more realisti 
aisal of stic 


EXperi; 
mental data 
critical—one can 
But my purpose 1s 
for more realistic 
arrived for some 
retends to any 
ts along these 


So far 
mya 

uy SA onis have tended to be aggressively 

ot Just to e emics when butchering sacred cows. 
A chniques of cpr tenions, but to help clear the way 
on Structive it assessment, and the time has now 
ip eality; | ggestions. Little of what follows P 

nes should re merely urge that ongoing developmen 
see EOT the ceive maximal encouragement. 

Em to bee Statistical theoretician, the following problems would 

minently worthy of research: 


1 
“ Ofs 
w u i -js 
ha Preme importance for the theory of probabilit 


twe 
mea wee i 
n by a proposition’s “probability,” relative t 


y is analysis of 
o the evidence 
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provided. Most serious students of the philosophical Per aT g 
ili istics agree (cf. Braithwaite, pp. A i 
Bete ra ee h bability that the Genera 
bility of a proposition (e.g. the pro Í ve 
Hoe of Relativity is correct) does not, prima facie, seem i oe 
same sort of thing as the probability of an N eed cee 
À igh aes 
ility of getting a head when this coin is tossed). Do stic 
Pelee ant formulas which have been developed for seen 
of the latter kind also apply to hypothesis likelihoods? In ie 
are the probabilities of hypotheses quantifiable at all, ane i ‘bility. 
theory of inverse probability, do Bayes’ theorem and its proba "ad 
density refinements apply to hypothesis probabilities? These 
similar questions are urgently in need of clarification. 


2. If we are willing to assume that Bayes’ theorem, or something P 
it, holds for hypothesis probabilities, there is much that can be done z 
develop the classical theory of inverse probability. While comput on 
of inverse probabilities turns essentially upon the parametric a ane 
probability function, which states the probability of each alterna ye 
hypothesis in the set under consideration prior to the outcome of t 


. . si . in- 
experiment, it should be possible to develop theorems which are i 
variant over important sub-classes o 


and the “a posteriori” 


alternative hypotheses after the e 
difference in “information,” should be a potentially fruitful source’ 
of concepts with which to e 

“efficiency” of various statisti 
through repeated experim icl 
seems to me to have considerable import, though not one about which 
I am sanguine, is whether i 


be extended to hypothesis. 


y of E is q, 
if we are to 
background a 


necessary, e.g., 
attached to the 


a 
che ssumptions which always accompany 
statistical analysis, ct 

My suggestions for applied statistical analysis turn on the fai 
that while what is desired ; i 
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about this will exist only as a subjective feel, differing from one person 
to the next, about the credibilities of the various hypotheses. 


3. Whenever possible, the basic statistical report should be in the 
form of a confidence interval. Briefly, a confidence interval is a subset 
of the alternative hypotheses computed from the experimental data 
in such a way that for a selected confidence level g, the probability 
that the true hypothesis is included in a set so obtained is æ. Typically, 
an g-level confidence interval consists of those hypotheses under 
which the p value for the experimental outcome is larger than 1 — a 
(a feature of confidence intervals which is sometimes confused with 
thet definition), in which case the confidence-interval report is similar 
to a simultaneous null-hypothesis significance test of each hypothesis 
In the total set of alternatives. Confidence intervals are the closest we 
pee ae Present come to quantitative assessment of hypothesis- 
Probabilities (see technical note, below), and are currently our mon 
we oi way to eliminate hypotheses from practical considere Som 
RA Re a to act as though none of the hypotheses nore eg 
ofer % confidence interval are correct, we stand only a 5% e 
aaa (Note, moreover, that this probability of error para 2 i 
aoe rect simultaneous “rejection” of a major part of the total se a 
the NES hypotheses, not just to the incorrect rejection of on ae 
error) T method, and is a total likelihood of error, not just O m fs 
conve ne confidence interval is also a simple and effective PONR 
a y that all-important statistical datum, the conditional pro TE 
oe ney density) function—i.e., the probability (pro a pee 
Since X ofthe observed outcome under each alternative hpo 
intery. a f givet kind of observed statistic and method of con a 
Para: al determination, there will be a fixed relation be eal 
aban of the confidence interval and those of the condi on 
conse ility (probability density) function, with the end pona oe 
Unale interval typically marking the points at which cone 
Small Probability (probability density) function sinks below ae 
repo. value related to the parameter œ. The confidere i 
y vi Not biased toward some favored hypothesis, as 1S the a 
fave Significance test, but makes an impartial sinil a 
confid, ton of all the alternatives under consideration. Nor ss er 
o ence interval involve an arbitrary decision as does the N hes 
intervat one person may prefer to report, say, E PE eae 
conflict while another favors 99% confidence ee ie EN 
matio here, for these are simply two ways to convey the eee 
and so „An experimental report can, with complete a 
Some benefit, simultaneously present several confidence 1 
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for the parameter being estimated. On the other hand, different 
choices of significance level in the NHD method is a clash of incom- 
patible decisions, as attested by the fact that an NHD analysis which 
simultaneously presented two different significance levels would 
yield a logically inconsistent conclusion when the observed statistic 


has a value in the acceptance region of one significance level and in the 
rejection region of the other. 


Technical Note: One of the more important problems now con- 
fronting theoretical statistics is exploration and clarification of the 
relationships among inverse probabilities derived from confidence- 
interval theory, fiducial-probability theory (a special case of the former 
in which the estimator is a sufficient statistic), and classical (i.e. 
Bayes’) inverse-probability theory. While the interpretation of 
confidence intervals is tricky, it would be a mistake to conclude, as the 
cautionary remarks usually accompanying discussions of confidence 
intervals sometimes seem to imply, that the confidence-level « of a 
given confidence interval I should not really be construed as a proba- 
bility that the true hypothesis, H, belongs to the set J. Nonetheless, if J 
is an a-level confidence interval, the probability that H belongs to J as 
computed by Bayes’ theorem given an a priori probability distribution 
will, in general, not be equal to o, nor is the difference necessarily 4 
small one—it is easy to construct examples where the a posterior! 
probability that H belongs to I is either 0 or 1. Obviously, whe? 
different techniques for computing the probability that H belongs t 
I yield such different answers, a reconciliation is demanded. In this 
instance, however, the apparent disagreement is largely if not entirely 
spurious, resulting from differences in the evidence relative to which the 
probability that H belongs to I is computed. And if this is, in fact, the 
correct explanation, then fiducial probability furnishes a partial 
solution to an outstanding difficulty in the Bayes’ approach. A majo" 
weakness of the latter has always been the problem of what to assume 
for the a priori distribution when no pre-experimental information '§ 
A ne other than that supporting the background assumptions 
ti ee the set of hypotheses under consideration. The tradi- 
ional assumption (made hesitantly by Bayes, less hesitantly ÞY P'S 
e has been the “principle of insufficient reason,” namel, 
But given no knowledge at all, all alternatives are equally like e 
ut not only is it difficult to give a convincing argument for th? 
assumption, it does not even yield a unique a priori probability 


distribution over a continuum of alternative hypotheses, since there 


aS many ways to express such a continuous set, and what is an ert 
ikelihood a priori distribu i 


tion under one of these does not necessa" 
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transform into the same under another. Now, a fiducial probability 
distribution determined over a set of alternative hypotheses by an 
experimental observation is a measure of the likelihoods of these 
hypotheses relative to all the information contained in the experimental 
data, but based on no pre-experimental information beyond the 
background assumptions restricting the possibilities to this particular 
set of hypotheses. Therefore, it seems reasonable to postulate that the 
no-knowledge a priori distribution in classical inverse probability 
theory should be that distribution which, when experimental data 
capable of yielding a fiducial argument are now given, results in an a 
posteriori distribution identical with the corresponding fiducial 


distribution. 


4. While a confidence-interval analysis treats all the alternative 
hypotheses with glacial impartiality, it nonetheless frequently occurs 
that our interest is focused on a certain selection from the set of 
possibilities. In such case, the statistical analysis should also report, 
when computable, the precise p value of the experimental outcome, 
or better, though less familiarly, the probability density at that outcome, 
under each of the major hypotheses; for these figures will permit an 
immediate judgment as to which of the hypotheses is most favored by 
the data. In fact, an even more interesting assessment of the post- 
lities of the hypotheses is then possible through 
use of “likelihood ratios” if one is willing to put his pre-experimental 
feelings about their relative likelihoods into a quantitative estimate. 
For let Pr(H, d), Pr(d, H), and Pr(H) be, respectively, thg pe 
of a hypothesis H in light of the experimental data d (added to ee 
information already available), the probability of data dunder hypoth 
esis H, and the pre-experimental (i.e, a priori) probability of 
Then for two alternative hypotheses Ho and Hj, it follows by classica! 


theory that 
Pr(Ho d) Pr(Ho) 5 Pr(d, Ho) (1)? 
Pr(H,,d) Pr(H,)  Pr(d, H,) 


Therefore, if the experimental report inc 
probability density) of the data under Ho 


experimental credibi 


ludes the probability (or 
and H,, respectively, and 


2. When the numbers of alternative- hypotheses and nasibe. experimental 
outcomes ate transfinite, Lia M= PAd eve) = 9 immost o a 
If so, the probability ratios in Formula 1 are replaced Sse os poemon : ihg 
probability-density ratios. It should be mentioned t m is ste sn 
idealistically presupposes there to be no doubt about the correctness of the 


background statistical assumptions. 
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i ify hi i lative pre-experimental 
ts reader can quantify his feelings about the re I 
rite of Ho and H, (ie., Pr(Ho)/Pr(H,)), he can then determine the 
judgment he should make about the relative merits of Hy and H, in 
light of these new data. 


5. Finally, experimental journals should allow the researcher much 
‘more latitude in publishing his statistics in whichever form seems 
most insightful, especially those forms developed by the moosi 
theory of estimates. In particular, the stranglehold that conventiona 
null-hypothesis significance testing has clamped on publication 
standards must be broken. Currently justifiable inferential algorithm 
carries us only through computation of conditional probabilities ; 
from there, it is for everyman’s clinical judgment and methodological 
conscience to see him through to a final appraisal. Insistence that 
published data must have the biases of the NHD method built into the 
report, thus seducing the unwary reader into a perhaps highly inappro- 


priate interpretation of the data, is a professional disservice of the first 
magnitude. 


Summary 


The traditional null-h 
priately called “ 
Statistical analysis 
ness as a method 
to the method are 
of a scientific iny 
evaluation of pro 


ypothesis significance-test method, more ee 
null-hypothesis decision [NHD] procedure,” O 

is here vigorously excoriated for its inappropriate- 
of inference. While a number of serious objections 
raised, its most basic error lies in mistaking the cue 
estigation to be a decision, rather than a cognitive 
Positions. It is further argued that the proper applica- 
tion of statistics to Scientific inference is irrevocably committed to 
extensive consideration of inverse probabilities, and to further this 
end, certain suggestions are offered, both for the development © 


Statistical theory and for more illuminating application of statistical 
analysis to empirical data. 
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That which 
Wi i 4 . 

robes to ee as the “crisis of psychology” is closely 
aay” Thet an (1958) has called the “crisis in statistical 
3 the field of ps ke of investigations which pass for researc 
nificance, Niost ology today entail the use of statistical tests © 
Problem he: wishes ee when a psychologist finds @ 
aJbotheses into pr a he converts his intuitions an 
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ristically allow the result of the test of significance tO 


ear the 
essenti Pi sas 
Taw, ial responsibility for the conclusions which he wi 


The maj 
Lor i Z e 
point of this paper is that the test of significance does 
hological phenomena 


not prov; 
i vee Information conesmine Pa 
mischief has pateu to it: and that, furthermore, a great deal 
en associated with its use. What will be said in this 
at “eyerybody 
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knows.” To say it “out loud” is, as it were, to assume the role of the 
child who pointed out that the emperor was really outfitted only 
in his underwear. Little of that which is contained in this paper is not 
already available in the literature, and the literature will be cited. 
Lest what is being said in this paper be misunderstood, some 
clarification needs to be made at the outset. It is not a blanket criticism 
of statistics, mathematics, or, for that matter, even the test of significance 
when it can be appropriately used. The argument is rather that the 
test of significance has been carrying too much of the burden of 
scientific inference. Wise and ingenious investigators can find their 
way to reasonable conclusions from data because and in spite of their 
procedures. Too often, however, even wise and ingenious investigators, 
for varieties of reasons not the least of which are the editorial policies 
of our major psychological journals, which we will discuss below, tend 
to credit the test of significance with properties it does not have. 


Logic of the test 
of significance 


The test of significance has as its aim obtaining information concerning 
a characteristic of a population which is itself not directly observable, 
whether for practical or more intrinsic reasons. What is observable is 
the sample. The work assigned to the test of significance is that of 
aiding in making inferences from the observed sample to the un- 
observed population. 

The critical assumption involved in testing significance is that. if 
the experiment is conducted properly, the characteristics of the 
population have a designably determinative influence on samples drawn 
from it, that, for example, the mean of a population has a determinative 
influence on the mean ofa sample drawn from it. Thus if P, the popula- 
tion characteristic, has a determinative influence on S, the sample 
ee eats, then there is some license for making inferences from 

__ Ifthe determinative influence of P on S could be put in the form of 
simple logical implication, that P implies S, the problem would be 
ame simple. „For, then we would have the simple situation : if | 
implies S, and if S is false, P is false. There are some limited instances iM 
which this logic applies directly in sampling. For example, if ae 
Tange of values in the population is between 3 and 9 (P), then the range 
ee any “aple must be between 3 and 9 (S). Should we fue 

a sam i i ; and W 
could assert that Pig lee’ S Spaen tea 8 ie sy oa 


It is clear from this, however, that, strictly speaking, one can only 
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go from the denial of S to the denial of P; and not from the assertion 
of S to the assertion of P. It is within this context of simple logical 
implication that the Fisher school of statisticians have made important 
contributions—and it is extremely important to recognize this as the 
context. 

In contrast, approaches based on the theorem of Bayes (Bakan, 
1953, 1956; Edwards, Lindman, and Savage, 1963; Keynes, 1948; 
Savage, 1954; Schlaifer, 1959) would allow inferences to P from S 
even when S is not denied, as S adding something to the credibility of 
P when S is found to be the case. One of the most viable alternatives to 
the use of the test of significance involves the theorem of Bayes; and 
the paper by Edwards et al. (1963) is particularly directed to the atten- 
tion of psychologists for use in psychological research. 3 

The notion of the null hypothesis’ promoted by Fisher constituted 
an advance within this context of simple logical implication. It 
allowed experimenters to set up a null hypothesis complementary to the 
hypothesis that the investigator was interested in, and provided him 
With a way of positively confirming his hypothesis. Thus, for example, 
the investigator might have the hypothesis that, say, normals differ 
from schizophrenics. He would then set up the null hypothesis that 
the means in the population of all normals and all schizophrenics were 
equal. Thus, the rejection of the null hypothesis constituted a way of 
asserting that the means of the populations of normals and schizo- 
Phrenics were different, a completely reasonable device whereby to 
affirm a logical antecedent. SEE 

The model of simple logical implication for making inferences 
from S to P has another difficulty which the Fisher approach sought 
to overcome. This is that it is rarely meaningful to set up any simple 
“P implies S” model for parameters that we are interested in. In the 
case of the mean, for example, it is rather that P has a determinative 
influence on the frequency of any specific S. But one experiment does 
not provide many values of S to allow the study of their frequencies. 

t gives us only one value of S. The sampling distribution is conceived 


1. There is some confusion in the literature concerning the meaning of the 
term null hypothesis. Fisher used the term to designate any exact hypothesis 
that we might be interested in disproving, and “null” was used in the sense 
of that which is to be nullified (cf., e.g, Berkson, 1942). It has, however, 
also been used to indicate a parameter of zero (cf, e.g. Lindquist, 1940, 
P. 15), that the difference between the population means is zero, or the 
Correlation coefficient in the population is zero, the difference in proportions 
in the population is zero, etc. Since both meanings are usually intended in 
Psychological research, it causes little difficulty. 
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i i i ible values of S. 
i ifies the relative frequencies of all possi 
a m the help of an adopted level of =o ranc oe mie 
: i i io: 
i t, say that S was false: that is, any S which fell in a reg 
a rears frequency under the null hypothesis was, s 
5% would be considered false. If such an S actually occurred, we wo 
o 


be in a position to declare P to be false, still within the model of simple 
logical implication. 


ant: 

i i be called the once-ness of the experiment ; 

Fisher approach is what may be calle elea been 
. Ifan S which has a low robability under the nu 

conducted once. If an S w A sicll Kyaothiesis is false 

As Fisher (1947, p. 14) put it, why should the theoretically rare even 


“us”? If it does occur, we 


in a hypothetical po 
manner, but only or 
the probability of fa 
exactly that value 
Replication of thee 
unless the replicati 
probabilities of th 


2. I playfully once c 
that every coin ha: 
that if the spirit is 


onducted the followin 
s associated with it a “ 
implored properly, 
rit. I thus invoked 


it came up head. I did it again, it came up head again. Idi 
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inference model itself. Lest he be done a complete injustice, it should be 
pointed out that he did say, “In relation to the test of significance, 
we may say that a phenomenon is experimentally demonstrable when 
we know how to conduct an experiment which will rarely fail to give 
us statistically significant results [1947, p. 14].” However, although 
Fisher “himself” believes this, it is not built into the inference model.’ 


Difficulties of 
the null hypothesis 


As already indicated, research workers in the field of psychology place 
a heavy burden on the test of significance. Let us consider some of the 
difficulties associated with the null hypothesis. 


l. The a priori reasons 
Jor believing that the null 
hypothesis is generally 
false anyway 
One of the common experiences of research workers is the very high 
frequency with which significant results are obtained with large samppa 
Some years ago, the author had occasion to run a number of tests o 
Significance on a battery of tests collected on about 60,000 subjects 
from all over the United States. Every test came out siginn 
Dividing the cards by such arbitrary criteria as east vs. west on e 
Mississippi River, Maine vs. the rest ofthe country, North vs. Sout petos 
all produced significant differences in means. In some instanca 
the differences in the sample means were quite small, but nonethe ss 
the p values were all very low. Nunnally (1960) has reported a Tae a" 
eXperience involving correlation coefficients on 700 subjects. pa 
etkson (1938) made the observation almost 30 years ago in conn 
ith chi-square: 

I believe that an observant statistician who has had ae es on 
Xperience with applying the chi-square test nee We ahaa ae 
MY statement that, as a matter of observation, when the m a MA 
data are quite large, the P’s tend to come out small. Having observed this, 


PoP ossibly not even this criterion is sound. It may be that Sara af 
Statistically significant results which are borderline mee SB 
hypothesis rather than against it [Edwards et al., 1963, p. o T o 
hypothesis were really false, then with an increase In the mne ol ms z 
'n which it can be rejected, there should be some substantia prop 
More dramatic rejections rather than borderline rejections. 
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and on reflection, I make he oe a patie ee se) 

i ion to the normal curve: e ; to 
Heat aa representing any real observations whatever 1 T 
in the physical world, then if the number of observations S a als 
large—for instance, 7 ki orar of 200,000—the chi-squar 

d any usual limit of significance: J 

ae Abonar statement is made on the basis of an EA A 
the observation referred to and can also be defended asa ees pe 
a priori considerations. For we may assume that it is practically eco 
that any series of real observations does not actually follow a A all 
curve with absolute exactitude in all respects, and no matter hay 3 i 
the discrepancy between the normal curve and the true curve of a me oe 
tions, the chi-square P will be small if the sample has a sufficiently larg 
number of observations in it. ; a 

If this be so, then we have something here that is apt to trouble T 
conscience of a reflective statistician using the chi-square test. z y i 
suppose it would be agreed by statisticians that a large sample is alw a 
better than a small sample. If, then, we know in advance the P that r E 
result from an application of a chi-square test to a large sample, Ho 
would seem to be no use in doing it on a smaller one. But since the resu 
of the former test is known, it is no test at all [pp. 526-527]. 


As one group of authors has put it, “ 


in typical applications . . . the null 
hypothesis... is k 


nown by all concerned to be false from the onis 
he fact of the matter is that sher , 
he null hypothesis to be true in ats 
say, of all scores east of the Men 
of the Mississippi? Why should i 
00 in the population? Why sho 


is really no go 


population. Why should the mean, 
sippi be identical to all scores west 


The reason why the null hypothesis is characteristically rejectes 
with large samples was made patent by the theoretical wor auil 
Neyman and Pearson (1933). The probability of rejecting the T 
hypothesis is a function of five factors: whether the test is cal 
two-tailed, the level of significance, the standard deviation, the amo AS. 
of deviation from the null hypothesis, and the number of observan ; 
The choice of a one- or two-tailed test is the investigator's; the ae 
Significance is also based on the choice of the investigator; the stan b 
deviation is a given of the situation, and is characteristically reason’ 
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well estimated; the deviation from the null hypothesis is what is 
unknown; and the choice of the number of cases in psychological 
work is characteristically arbitrary or expeditious. Should there be any 
deviation from the null hypothesis in the population, no matter how 
small—and we have little doubt that such a deviation usually exists—a 
sufficiently large number of observations will lead to the rejection of the 
null hypothesis. As Nunnally (1960) put it, 


if the null hypothesis is not rejected, it is usually because the N is too 
small. If enough data are gathered, the hypothesis will generally be 
rejected. If rejection of the null hypothesis were the real intention in 
Psychological experiments, there usually would be no need to gather 
data [p. 643]. 


2. Type I error and 
publication practices 


The Type I error is the error of rejecting the null hypothesis when it is 
indeed true, and its probability is the level of significance. Later in this 
paper we will discuss the distinction between sharp and loose null 
hypotheses. The sharp null hypothesis, which we have been discussing, 
Is an exact value for the null hypothesis as, for example, the difference 
between population means being precisely zero. A loose null hypothesis 
iS One in which it is conceived of as being around null. Sharp null 
hypotheses, as we have indicated, rarely exist in nature. Assuming that 
loose null hypotheses are not rare, and that their testing may make 
Sense under some circumstances, let us consider the role of the publica- 
tion practices of our journals in their connection. e 
It is the practice of editors of our psychological journals, receiving 
many more papers than they can possibly publish, to use the magnitude 
Of the p values reported as one criterion for acceptance or rejection ofa 
Study. For example, consider the following statement made by Arthur 
- Melton (1962) on completing 12 years as editor of the Journal of 
Xperimental Psychology, certainly one of the most prestigious and 
Scientifically meticulous psychological journals. In enumerating the 
criteria by which articles were evaluated, he said : 


The next step in the assessment of an article involved a judgment 
With respect to the confidence to be placed in the findings—confidence 
that the results of the experiment would be repeatable under the conditions 
described. In editing the Journal there has been a strong reluctance to 
accept and publish results related to the principal concern of the research 
When those results were significant at the .05 level, whether by one- or 
two-tailed test. This has not implied a slavish worship of the .01 level, as 
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some critics may have implied. Rather, it reflects a belief that it is Ha 
responsibility of the investigator in a science to reveal his effect m re 
a way that no reasonable man would be in a position to discredit a 
results by saying that they were the product of the way the ball boun 
[pp. 553-554]. 


His clearly expressed opinion that nonsignificant eenules shoul? 
not take up the space of the journals is shared by most ecto 
psychological journals. It is important to point out that I jal 
advocating a change in policy in this connection, In the total resea z 
enterprise where so much of the load for making inferences conem ng 
the nature of phenomena is carried by the test of significance, d 
editors can do little else. The point is rather that the situation in KE 
to publication makes manifest the difficulties in connection with t 
overemphasis on the test of significance as a principal basis for making 
inferences. À F 

McNemar (1960) has rightly pointed out that not only do jour : 
editors reject papers in which the results are not significant, but, tha 
papers in which significance has not been obtained are not submittec A 
that investigators select out their significant findings for inclusion w 
their reports, and that theory-oriented research workers tend i 
discard data which do not work to confirm their theories. The resu 5 
of all of this is that “published results are more likely to involve ia 
rejection of null hypotheses than indicated by the stated levels a 
significance [p. 300],” that is, published results which are significan 


i ich 
may well have Type I errors in them far in excess of, say, the 5% whic! 
we may allow ourselves, 


The suspicion that the Ty 
literature is given confirmation 
the Journal of Abnormal and So 


(Cohen, 1962). Analyzing 70 studies in which significant results Wer? 
obtained with r 


when the null hypothe: 
Theoretically, with such 


strongly points to the 
published is associate 
practices themselves ar 
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which we base our conclusions concerning the nature of psychological 
phenomena. Our total research enterprise is, at least in part, a kind of 
Scientific roulette, in which the “lucky,” or constant player, “wins,” 
that is, gets his paper or papers published. And certainly, going from 
5% to 1% does not eliminate the possibility that it is “the way the 
ball bounces,” to use Melton’s phrase. It changes the odds in this 
roulette, but it does not make it less a game of roulette. 

The damage to the scientific enterprise is compounded by the fact 
that the publication of “significant” results tends to stop further 
investigation. If the publication of papers containing Type I errors 
tended to foster further investigation so that the psychological 
Phenomena with which we are concerned would be further probed by 
others, it would not be too bad. But it does not. Quite the contrary. 
As Lindquist (1940, p. 17) has correctly pointed out, the danger to 
Science of the Type I error is much more serious than the Type IT 
error—for when a Type I error is committed, it has the effect of stopping 
investigation. A highly significant result appears definitive, as Melton’s 
comments indicate. In the 12 years that he edited the Journal of 
Experimental Psychology, he sought to select papers which were worthy 
of being placed in the “archives,” as he put it. Even the strict repetition 
of an experiment and not getting significance in the same way does not 
Speak against the result already reported in the literature. For failing 
to get significance, speaking strictly within the inference model, oniy 
means that that experiment is inconclusive; whereas the study 
already reported in the literature, with a low p value, is regarded as 
conclusive, Thus we tend to place in the archives studies with ? 
relatively high number of Type I errors, or, at any rate, studies which 
reflect small deviations from null in the respective populations; 
and we act in such a fashion as to reduce the likelihood of their 
correction. 


Psychologist’s “adjustment” 

Y misinterpretation 
The psychological literature is filled with misinterpretations of the 
nature of the test of significance. One may be tempted to attribute 
this to such things as lack of proper education, the simple fact that 

umans may err, and the prevailing tendency to take a cookbook 
pproach in which the mathematical and philosophical framework 
Out of which the tests of significance emerge are ignored; that, in other 
Words, these misinterpretations are somehow the result of simple 
intellectual inadequacy on the part of psychologists. However, such 
an explanation is hardly tenable. Graduate schools are adamant 
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A ists 
with respect to statistical education. mbe A ene 
have taken out substantial amounts of time to X amo aon 
mathematically and philosophically. Psychologists a a bliain 
great deal of mutual criticism. Editorial reviews feo en event 
are carried out with eminent conscientiousness. T er f siatigtical 
substantial literature devoted to various kinds of misuse” 0 
procedures, to which not a little attention has been paid. ENA 
It is rather that the test of significance is profoundly int Mee 
with other strands of the psychological research Satter eae fic 
way that it constitutes a critical part of the total cultura aid pe 
tapestry. To pull out the strand of the test of significance wo tee 
to make the whole tapestry fall apart. In the face of the i ag 
difficulties that the test of significance provides, we rather ae 
make an “adjustment” by attributing to the test of signi Sthal 
characteristics which it does not have, and overlook characteristic? “iiy 
it does have. The difficulty is that the test of significance can, Hri the 
when not considered too carefully, do some work; for, after a his 
results of the test of significance are related to the phenomena in V A 
we are interested. One may well ask whether we do not have artial 
perhaps, an instance of the phenomenon that learning under p: 


f : $ : are: f these 
reinforcement is very highly resistant to extinction. Some © 
misinterpretations are as follows: 


1. Taking the p value as 
a “measure” of 
significance 


ws x r : ard it 
A common misinterpretation of the test of significance is to reg! 


- to the 
as a “measure” of significance, It is interpreted as the at Sa as 
question “How significant is it?” A p value of .05 is honey ae 
less significant than a p value of 01, and so on. The nana then 
practice on the part of psychologists is to compute, say, a t, an 


@ E ne se a function 0) 
look up” the significance in the table, taking the p value as a func Jue 18 
t, and thereby 


a “measure” of significance. Indeed, since the P tween 
inversely related to the magnitude of, say, the difference d score” 
means in the sample, it can function as a kind of “standar ily, the 
measure for a variety of different experiments. Mathema a 
t is actually very similar to a “standard score,” entailing 4 inator ; 
in the numerator, and a function of the variation in the dnon wou 
and the p value is a “function” of t. If this use were explicit, Ke using 
perhaps not be too bad. But it must be remembered that this 1 

the p value as 

automatically gi 


d does "o; 
a statistic descriptive of the sample alone, es even 
ive an inference to the population. There 
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practice of using tests of significance in studies of total populations, in 
v hich the observations cannot by any stretch of the imagination be 
t ought of as having been randomly selected from any designable 
population.* Using the p value in this way, in which the statistical 
inference model is even hinted at, is completely indefensible; for the 
single function of the statistical inference model is making inferences to 
Populations from samples. 
a The practice of “looking up” the p value for the t, which has even 
ios advocated in some of our statistical handbooks (e.g., Lacey, 
eee p. 117; Underwood, Duncan, Taylor, and Cotton, 1954, p. 129), 
ather than looking up the t for a given p value, violates the inference 
Model. The inference model is based on the presumption that one 
initially adopts a level of significance as the specification of that 
Probability which is too slow to occur to “us,” as Fisher has put it, in 
is one instance, and under the null hypothesis. A purist might speak 
of the “delicate problem . . . of fudging with a posteriori alpha values 
[levels of significance. Kaiser, 1960, p. 165],” as though the levels of 
significance were initially decided upon, but rarely do psychological 
research workers or editors take the level of significance as other than 
a “measure,” 
E But taken as a “measure,” 
Sychologists often erroneously beli 
Probability that the results are due to c 
a Pointed out; that a p value of .05 Í 
ER the scientific hypothesis is correct, as Bolles (1962) has pointed out; 
ti at it is a measure of the power to “predict” the behavior of a popula- 
ton (Underwood et al., 1954, p. 107); and that it is a measure of the 
i eae that the results of the experiment would be repeatable 
nder the conditions described,” as Melton put it. Unfortunately, 
None of these interpretations are within the inference model of the test 
Of significance. Some of our statistical handbooks have “allowed” 
Misinterpretation. For example, in discussing the erroneous rhetoric 
Associated with talking of the “probability” of a population parameter 
(in the inference model there is no probability associated with something 
Which is either true or false), Lindquist (1940) said, “For most practical 
Purposes, the end result is the same as if the ‘level of confidence’ type 
of interpretation is employed [p. 14].” Ferguson (1959) wrote, “The 
Sand .01 probability levels are descriptive of our degree of confidence 
but that sizable differences, correla- 
asonable size, speak more 


it is only a measure of the sample. 
eve that the p value is “the 
hance,” as Wilson (1961, p. 230) 
means that the chances are 95 


tudies to exemplify points such as this 
ply them for himself. 


4. ; z 
It was decided not to cite any specific s 


one. The reader will undoubtedly be able to sup 
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strongly of sizable differences, correlations, etc, in oe popa 
and there is little question but that if there is real and s a ee 
in the population, it will continue to manifest itself in T e n 
However, these are inferences which we may make. They =: E 
the inference model associated with the test of significance. be i bau 
within the inference model is only the value which we e i oak 
how improbable an event could be under the null hypot g fe 
we judge will not take place to “us,” in this one experiment. 


Saar jects 
psychologists of the Meaning of the test of significance. The subjec 
were 9 members of 


doctoral degrees, and 
Dakota; and there is 
chologists was more o 
asked to rate their deg 
cal studies for a varie 
there should be a rel 


can be more confi 
how could a grou Jue is 4 
Wrongness is based on the commonly held belief that the p va such a 
“measure” of degree of confidence. Thus, the reasoning behind s been 
wrong set of answers by these Psychologists may well Nave cae 
something like this: the p value is a measure of confidence; but a iven 
number of cases also increases confidence; therefore, for any & 


naer N: 

P value, the degree of confidence should be higher for the rage the 
he wrong Conclusion arises from the erroneous character O‘. 
first premise, and from 


] ) e is 4 
p the failure to recognize that the p v the 
function of sample size for any given deviation from nu of very 
Population. The author knows of instances in which editors 
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reputable psychological journals have rejected papers in which the 
p values and n’s were small on the grounds that there were not enough 
observations, clearly demonstrating that the same mode of thought 
is operating in them. Indeed, rejecting the null hypothesis with a 
small n is indicative of a strong deviation from null in the population, 
the mathematics of the test of significance having already taken into 
account the smallness of the sample. Increasing the n increases the 
Probability of rejecting the null hypothesis; and in these studies 
rejected for small sample size, that task has already been accomplished. 
These editors are, of course, in some sense the ultimate “teachers” 
of the profession; and they have been teaching something which is 
patently wrong! 


2. Automaticity of 
inference 


What may be considered to be a dream, fantasy, or ideal in the culture 
of Psychology is that of achieving complete automaticity of inference. 

he making of inductive generalizations is always somewhat risky. 
In Fisher's The Design of Experiments (1947, p. 4), he made the claim 
that the methods of induction could be made rigorous, exemplified by 
the procedures which he was setting forth. This is indeed quite correct 
in the sense indicated earlier. In a later paper, he made explicit what 
Was strongly hinted at in his earlier writing, that the methods which he 
Proposed constituted a relatively complete specification of the process 
of induction: 


That such a process induction existed and was possible to normal minds, 
has been understood for centuries; it is only with the recent development 
Of statistical science that an analytic account can now be given, about as 
Satisfying and complete, at least, as that given traditionally of the 
deductive processes [ Fisher, 1955, p. 74]. 


Psychologists certainly took the procedures associated with the £ test, 

test, and so on, in this manner. Instead of having to engage in inference 
themselves, they had but to “run the tests” for the purpose of making 
inferences, since, as it appeared, the statistical tests were analytic 
analogues of inductive inference. The “operationist” orientation 
among psychologists, which recognized the contingency of knowledge 
on the knowledge-getting operations and advocated their specification, 
Could, it would seem, “operationalize” the inferential processes simply 

Y reporting the details of the statistical analysis! It thus removed the 

urden of responsibility, the chance of being wrong, the necessity for 
Making inductive inferences, from the shoulders of the investigator 
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A e 
d placed them on the tests of significance. The contingeney onas 
as 7 ion upon the experimenter’s decision of the level of sign cial 
a sand ed in two ways. The first, by resting on a kind of Sa 
e A 5% was good, and 1 % better. The second in E 
which has already been discussed, by not making a decision a P anda 
of significance, but only reporting the p value as a -— thatthe 
presumably objective “measure” of degree of confidence. Tooo 
probability of getting significance is also contingent upon 
of observations has been handled largely by ignoring it. he: matter 
A crisis was experienced among psychologists when ut for het 
of the one- versus the two-tailed test came into prominence; rite 
the contingency of the result ofa test of significance ona GEE 
investigator was simply too conspicuous to be ignored. An inve: a 
say, was interested in the difference between two groups oO -than 
measure. He collected his data, found that Mean A was greater aa. 
Mean B in the sample, and ran the ordinary two-tailed t E The 
let us say, it was not significant. Then he bethought himse “tation 
two-tailed test tested against two alternatives, that the eer thee 
Mean A was greater than Population Mean B and vice versa. Bu 
he really wanted to know whether 
Thus, he could run a one- 
one-tailed test is more po 
Now here there was 
nearly so automatic an i 
manifestly contingent on 
to run a one- 


k a 
somehow overcome this particular ci 
a test of significance on the Coe erte 
l not attempt here to review this “intrinsic 
except to cite one very competent paper which points up the rdum to 
difficulty associated with this problem, the reductio ad absur uishe! 

which one comes. Kaiser (1960), early in his paper, disting other 
between the logic associ i 
forms of inference, ad 
have allowed: “The 


logical consider. 


Contingency of the results of 
investigator. The author wil 


on 
s i ased 
arguments developed in this paper ase course 
ations in statistical inference. (We do not, 
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suggest that statistical inference is the only basis for scientific inference) 
[p. 160].” But then, having taken the position that he is going to 
follow the logic of statistical inference relentlessly, he said (Kaiser’s 
italics): “we cannot logically make a directional statistical decision or 
statement when the null hypothesis is rejected on the basis of the direction 
of the difference in the observed sample means [p. 161].” One really 
needs to strike oneself in the head! If Sample Mean A is greater than 
Sample Mean B, and there is reason to reject the null hypothesis, 
in what other direction can it reasonably be? What kind of logic is 
it that leads one to believe that it could be otherwise than that Popula- 
tion Mean A is greater than Population Mean B? We do not know 
whether Kaiser intended his paper as a reductio ad absurdum, but it 
certainly turned out that way. 

The issue of the one- versus the two-tailed test genuinely challenges 
the presumptive “objectivity” characteristically attributed to the test 
Of significance. On the one hand, it makes patent what was the case 
under any circumstances (at the least in the choice of level of significance, 
and the choice of the number of cases in the sample), that the conclusion 
is contingent upon the decision of the investigator. An astute investi- 
gator, who foresaw the results, and who therefore pre-decided to use 
a one-tailed test, will get one p value. The less astute but honorable 
investigator, who did not foresee the results, would feel obliged to use 
a two-tailed test, and would get another p value. On the other hand, 
if one decides to be relentlessly logical within the logic of statistical 
Inference, one winds up with the kind of absurdity which we have 
Cited above, 


3. The confusion of 
induction to the aggregate 
With induction to the 
general 


Consider a not atypical investigation of the following sort: A group of, 
Say, 20 normals and a group of, say, 20 schizophrenics are given a test. 

he tests are scored, and a ¢ test is run, and it is found that the means 
differ significantly at some level of significance, say 1%. What inference 
can be drawn? As we have already indicated, the investigator could 
have insured this result by-choosing a sufficiently large number of cases. 

UPPose we overlook this objection, which we can to some extent, by 
Saying that the difference between the means in the population must 

ave been large enough to have manifested itself with only 40 cases. 
But still, what do we know from this? The only inference which this 
allows is that the mean of all normals is different from the mean of all 
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schizophrenics in the populations from which the _samples have 
presumably been drawn at random. (Rarely is the criterion of random- 
ness satisfied. But let us overlook this objection too.) i 

The common rhetoric in which such results are discussed is in the 
form “Schizophrenics differ from normals in such and such ways. 
The sense that both the reader and the writer have of this rhetoric 1s 
that it has been justified by the finding of significance. Yet clearly it 
does not mean all schizophrenics and all normals. All that the test 
of significance justifies is that measures of central tendency of the 
aggregates differ in the populations. The test of significance has not 
addressed itself to anything about the schizophrenia or normality 
which characterizes each membe 
it is certainly possible for an investigator to develop a hypothesis about 
the nature of schizophrenia from which he may infer that there should 
be differences betwee 


of a significant difference in the means of his sample would add to the 


processes, 
Or consider another ha 
eri 


exp ter divides 40 Subjects at random into two groups of 20 
subjects each One group is assigned to one condition and the other to 


Conditions,” feeling that he 
because of his test of sig 
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allows him to make his statement for the population, but only for that 
learning task, and the p value is appropriate only to that. But the 
generalization to “massed conditions” and “distributed conditions” 
beyond that particular learning task is a second inference with respect 
to which the p value is not relevant. The psychological literature is 
Plagued with any number of instances in which the rhetoric indicates 
that the p value does bear on this second inference. 

Part of the blame for this confusion can be ascribed to Fisher 
who, in The Design of Experiments (1947, p. 9), suggested that the 
mathematical methods which he proposed were exhaustive of scientific 
induction, and that the principles he was advancing were “common to 
all experimentation.” What he failed to see and to say was that after 
an inference was made concerning a population parameter, one still 
ene to engage in induction to obtain meaningful scientific proposi- 
ions. 
__, To regard the methods of statistical inference as exhaustive of the 
inductive inferences called for in experimentation is completely 
Confounding. When the test of significance has been run, the necessity 
for induction has hardly been completely satisfied. However, the 
research worker knows this, in some sense, and proceeds, as he should, 
to make further inductive inferences. He is, however, still ensnarled in 
his test of significance and the presumption that it is the whole of his 
Inductive activity, and thus mistakenly takes a low p value for the 
Measure of the validity of his other inductions. ; ; 

The seriousness of this confusion may be seen by again referring 
back to the Rosenthal and Gaito (1963) study and the remark by 
Berkson which indicate that research workers believe that a large 
Sample is better than a small sample. We need to refine the rhetoric 
Somewhat. Induction consists in making inferences from the particular 
to the general. It is certainly the case that as confirming particulars 
are added, the credibility of the general is increased. However, the 
addition of observations to a sample is, in the context of statistical 
inference, not the addition of particulars but the modification of what is 
One particular in the inference model, the sample aggregate. | In the 
Context of statistical inference, it is not necessarily true that “a large 
Sample is better than a small sample.” For, as has been already 
Indicated, obtaining a significant result with a small sample suggests 
a larger deviation from null in the population, and may be considerably 
More Meaningful. Thus more particulars are better than fewer 
Particulars on the making of an inductive inference ; but not necessarily 
à larger sample. “en 

In the marriage of psychological research and statistical inference, 
Psychology brought its own reasons for accepting this confusion, 
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ind 
laws which characterize the “generalized, normal, human, eee 
[Boring, 1950, p. 413].” The research Strategy associate M 
kind of psychology is straightforwardly inductive. It seeks ra oe, 
pply to every member of a designated nof 
h a generalization fails forces a ma, n 
for either a redefinition of the class to kea 
n of the generalization. The other tra ms ce 
dual differences, which has its roots mor all 
States than on the continent, We may ee 
erican, James McKeen Cattell, who ae al 
fi e to Wundt with his own problem of indivi oe 
it was regarded by Wundt as ganz Amerikanisch (Boring, 

1950, p. 324), 


pits ich is of interest 
r tradition, it is the aggregate which is of interest, 
and not the general, O 


general proposition in whi 
The distinction b 
illuminated by a small 
of variance developed 
of choice among psy 
analysis of variance 
Subjects may have in 
Scores. This is all t 
following identity ill 
total sum Squares, of 
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is simply the partitioning of, is based on the literal difference Wi 
È nce between 
each pair of scores (cf. Bakan, 1955). Except for n. iti i - 
; y )- pt for n, 1t 1s the only informa. 


p8- je =E) teih Eran x) 
; n—1 ` 


n 


atte Thus, what took place historically in psychology is that instead of 
lo smpting to synthesize the two traditional approaches to psycho- 
oe phenomena, which is both possible and desirable, a syncretic 
mbination took place of the methods appropriate to the study of 
aggregates with the aims of a psychology which sought for general 
Pap peg One of the most overworked terms, which added not 
ki a to the essential confusion, was the term “error,” which was a 
in of umbrella term for (at the least) variation among scores from 
different individuals, variation among measurements for the same 
individual, and variation among samples. 
nie Let us add another historical note. 
“Bi L koown Psychometric Methods. 
a kind of “bible” for many psychologists, he made a noble effort at a 
Rapprochement of Psychophysical and Test Methods” (p. 9). He 
gosetea, quite properly, that mathematical developments in each of 
he two fields might be of value in the other, that “Both psychophysics 
and mental testing have rested upon the same fundamental statistical 
devices [p. 9].” There is no question of the truth of this. However, 


what he failed to emphasize sufficiently was that mathematics is so 


abstract that the same mathematics is applicable to rather different 
fields of investigation without there being any necessary further 
ty between them. (One would not, for example, argue that 
usiness and genetics are essentially the same because the same 
pe etic is applicable to market research and in the investigation of 
Pa EIS of heredity.) A critical point of contract between the two 
S was in connection with scaling in which Cattell’s principle 
nsi equally often noticed differences are equal unless always or never 
oe [Guilford, 1936, p. 217)” was adopted as a fundamental 
baea tion. The “equally often noticed differences” is, of course, 
on aggregates. By means of this assumption, one could collapse 


In 1936, Guilford published 
In this book, which became 
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the distinction between the two areas of investigation. Indeed, this is 
not really too bad if one is alertto the fact that it is an assumption, one 
which even has considerable pragmatic value. As a set of baer lo 
whereby data could be analyzed, that is, as a set of techniques whereby 
one could describe one’s findings, and then make inductions about the 
nature of the psychological phenomena, that which Guilford put 
together in his book was eminently valuable. However, around this 
time the work of Fisher and his school was coming to the attention of 
psychologists. It was attractive for several reasons. It offered advice 
for handling “small samples.” It offered a number of eminently 
ingenious new ways of organizing and extracting information from 
data. It offered ways by which several variables could be analyzed 
simultaneously, away from the old notion that one had to keep every- 
thing constant and vary only one variable at a time. It showed how 
the effect of the “interaction” of variables could be assessed. But it 
also claimed to have mathematized induction! The Fisher approach 
was thus “bought,” and psychologists got a theory of induction in the 
bargain, a theory which seemed to exhaust the inductive processes. 
Whereas the question of the “reliability” of statistics had been a matter 
of concern for some time before (although frequently very garbled), it 
had not carried the burden of induction to the degree that it did with 
the Fisher approach. With the “buying” of the Fisher approach the 
Psychological research worker also brought, and then overused, the 


test of significance, employing it as the measure of the significance, in 
the largest sense of the word, of his research efforts. 


Sharp and loose 
null hypotheses 


Earlier, a distinction was made between sharp and loose null hypoth- 
eses. One of the major difficulties associated with the Fisher approach 
is the problem presented by sharp null hypotheses; for, as we have 
already seen, there is reason to believe that the existence of sharp null 
hypotheses is characteristically unlikely. There have been some efforts 
to correct for this difficulty by Proposing the use of loose null hypoth- 
eses; in place of a single Point, a region being considered null. 
Hodges and Lehmann (1954) have Proposed a distinction between 
“statistical significance,” which entails the sharp hypothesis, and 
“material significance,” in which one tests the hypothesis of a deviation 
of a stated amount from the null Point instead of the null point itself. 
Edwards (1950, pp. 30-31) has Suggested the notion of “practical 
significance” in which one takes into account the meaning, in some 
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practical sense, of the magnitude of the deviation from null together 
with the number of observations which have been involved in getting 
Statistical significance. Binder (1963) has equally argued that a subset 
of parameters be equated with the null hypothesis. Essentially what 
has been suggested is that the investigator make some kind ofa decision 
concerning “How much, say, of a difference makes a difference?” 
The difficulty with this solution, which is certainly a sound one 
technically, is that in psychological research we do not often have very 
good grounds for answering this question. This is partly due to the 
inadequacies of psychological measurement, but mostly due to the 
fact that the answer to the question of “How much of a difference 
makes a difference?” is not forthcoming outside of some particular 
Practical context. The question calls forth another question, “How 
much of a difference makes a difference for what?” 


Decisions vs. assertions 


This brings us to one of the major issues within the field of statistics 
itself. The problems of the research psychologist do not generally 
lie within practical contexts. He is rather interested in making ae 
tions concerning psychological functions which have a reason ai 
amount of credibility associated with them. He is more concerne 
with “What is the case?” than with “What is wise to do?” (cf. Rozeboom, 
1960). 
It is here that the decision-theory approach of Neyman, reco 
and Wald (Neyman, 1937, 1957; Neyman and Pearson, 1933; Wa a 
1939, 1950, 1955) becomes relevant. The decision-theory school, s 
asing itself on some basic notions of the Fisher approach, deviate 
rom it in several respects: 
l. In Fisher’s inference model, the two alternatives between hoe 
One chose on the basis of an experiment were reject and seeks usive. 
s he said in The Design of Experiments (1947), “the null Aare ae 
Never proved or established, but is possibly disproved, in the coun 
oF experimentation [p. 16].” In the decision-theory approach, the 
two alternatives are rather reject and accept. 
2. Whereas in the Fisher approach the interpretation of the ea of 
Significance critically depends on having one, sample from a On 
thetical population of experiments, the decision-theory approac 
conceives of, is applicable to, and is sensible with respect to numerous 


Tepetitions of the experiment. 
3. The decision-theory approach added the notions of the Type II 
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error (which can be made only if the null hypothesis is accepted) and 
power as significant features of their model. 


4. The decision-theory model gavea significant place to the gio 
what is concretely lost if an error is made in the practical Se mie 
the presumption that accept entailed one concrete action, ren Ce 
another. It is in these actions and their consequences that ue 
basis for deciding on a level of confidence. The Fisher approac 

little to say about the consequences. 


As it has turned out, the field of application par excellence for 1 
decision-theory approach has been the sampling inspection of m 
produced items. In sampling inspection, the acceptable deviation oo 
nullcan be specified; both accept and reject are appropriate categori i 
the alternative courses of action can be clearly specified ; there is 
definite measure of loss for each possible action; and the choice m 
be regarded as one of a series of such choices, so that one can mapa 
the overall loss (cf. Barnard, 1954). Where the aim is only the won gee 
of knowledge without regard to a specific practical context, mes 
conditions do not often prevail. Many psychologists who learne 
about analysis of variance from books such as those by Saree 
(1946) found the examples involving log weights, etc. somewha 


annoying. The decision-theory school makes it clear that such 
practical contexts are 


Purposes, but actually a 
The contributions 
revealed the intrinsic nat 
by Fisher and his colle; 
associated with the tes 
or an induction, or a 
evaluation calculus. Fisher ( 


ological performance, 10 
d effort of a five-year plan for the pong! 
erican “ideological” orientation: “In t i 
ance of organized technology has I thin 


3 : ect 
© Process appropriate for drawing at 
conclusions, with those aimed rather at, let us say, speeding produc 


or saving money [p. 70).”> But perhaps a more reasonable be 
decision-theory school to have explica 
the work of the Fisher school. 


5. Fora teply to Fisher, see Pearson (1955). 
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Conclusion 


What then is our alternative, if the test of significance is really of such 
limited appropriateness as has been indicated? At the very least it 
would appear that we would be much better off if we were to attempt to 
estimate the magnitude of the parameters in the populations; and 
Tecognize that we then need to make other inferences concerning the 
Psychological phenomena which may be manifesting themselves in 
these magnitudes. In terms of a statistical approach which is an 
alternative, the various methods associated with the theorem of Bayes 
which was referred to earlier may be appropriate; and the paper by 
Edwards et al. (1963) and the book by Schlaifer (1959) are good starting 
Points. However, that which is expressed in the theorem of Bayes 
alludes to the more general process of inducing propositions concerning 
the nonmanifest (which is what the population is a special instance of) 
and ascertaining the way in which that which is manifest (which the 
Sample is a special instance of) bears on it. This is what the scientific 
method has been about for centuries. However, if the reader who might 
be sympathetic to the considerations set forth in this paper quickly 
goes out and reads some of the material on the Bayesian approach with 
the hope that thereby he will find a new basis for automatic inference, 
this paper will have misfired, and he will be disappointed. w 
That which we have indicated in this paper in connection with the 
test of significance in psychological research may be taken as an instance 
ofa kind of essential mindlessness in the conduct of research which may 
be, as the author has suggested elsewhere (Bakan, 1965), related to the 
Presumption of the nonexistence of mind in the subjects of psycho- 
logical research, Karl Pearson once indicated that higher statistics 
Were only common sense reduced to numerical appreciation. However, 
that base in common sense must be maintained with vigilance. When 
We reach a point where our statistical procedures are substitutes 
instead of aids to thought, and we are led to absurdities, then we must 
return to the common sense basis. Tukey (1962) has very properly 
Pointed out that statistical procedures may take our attention away 
from the data, which constitute the ultimate base for any inferences 
Which we might make. Robert Schlaifer (1959, p. 654) has dubbed 
the error of the misapplication of statistical procedures the “error of 
the third kind,” the most serious error which can be made. Berkson 
as Suggested the use of “the interocular traumatic test, you know what 
€ data mean when the conclusion hits you between the eyes [Edwards 
et al., 1963, p. 217].” We must overcome the myth that if our treatment 
of our subject matter is mathematical it is therefore precise and valid. 


athematics can serve to obscure as well as reveal. 
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Most importantly, we need to get on with the business ia ge 
psychological hypotheses and proceed to do investigations and ee e 
inferences which bear on them; instead of, as so much of our literature 
would attest, testing the statistical null hypothesis in any number o 
contexts in which we have every reason to suppose that it is false in the 
first place. 
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statistical significance in 
psychological research 


David T. Lykken 


In a recent journal article Sapolsky (1964) developed the following 
substantive theory: Some psychiatric patients entertain an unconscious 
belief in the “cloacal theory of birth” which involves the notions of 
oral impregnation and anal parturition. Such patients should be 
inclined to manifest eating disorders: compulsive eating 1n the case 
of those who wish to get pregnant and anorexia in those who do not. 
Such patients should also be inclined to see cloacal animals, such as 
frogs, on the Rorschach. This reasoning led Sapolsky to predict that 
Rorschach frog responders show a higher incidence of eating disorders 
than patients not giving frog responses. A test of this hypothesis in a 
Psychiatric hospital showed that 19 of 31 frog responders had eating 
isorders indicated in their charts, compared to only 5 of the 31 
Control patients. A highly significant chi-square was obtained. 4 
. It will be an expository convenience to analyze Sapolsky’s article 
1n considerable detail for purposes of illustrating the methodological 
issues which are the real subject of this paper. My intent is not to 
criticize a particular author but rather to examine a kind of epistemic 
confusion which seems to be endemic in psychology, especially, but by 


From Psychological Bulletin, Vol. 70 (No. 3), 1968, pp. 151-159. Copyright 1968 
Y the American Psychological Association. Reproduced by permission. 
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no means exclusively, in its “softer” precincts. One would like to 
demonstrate this generality with multiple examples. Having just 
combed the latest issues of four well-known journals in the clinical 
and personality areas, I could undertake to identify several papers in 
each issue wherein, because they were able to reject a directional null 
hypothesis at some high level of significance, the authors claimed to have 
usefully corroborated some rather general theory or to have demon- 
strated some important empirical relationship. To substantiate that 
these claims are overstated and that much of this research has not yet 
earned the right to the reader’s overburdened attentions would require 
a lengthy analysis of each paper. Such profligacy of space would ill 
become an essay one aim of which is to restrain the swelling volume of 
the psychological literature, Therefore, with apologies to Sapolsky for 
subjecting this one paper to such heavy handed scrutiny, let us proceed 
with the analysis, 

Since I regarded the prior probability of Sapolsky’s theory (that 
frog responders unconsciously believe in impregnation per os) to 
benugatory and its likelihood unenhanced by the experimental findings, 
I undertook to check my own reaction against that of 20 colleagues, 


to .13 with a median value of .01, which can be 
n, roughly, “I don’t believe it.” Since the prior 


oan y important scientific theories is considered to be 
vanishingly small when they are firs 


: l ch “corroborate” the theory by 
confirming the operational hypothesis derived from it with high 


Statistical Significance, these same ps ists @ osterior 
Probabilities to the theory which as ee ie T the 
median unchanged at .0]. | interpret this consensus to mean, roughly: 
I still don’t believe it.” This finding, I submit, is alarming because it 
signifies a sharp difference of opinion between, for example, the 
consulting editors of the journal and a substantial segment of its 
readership, a difference on the very fundamental question of wnan 
constitutes good (i.e., publishable) clinical research. 
The thesis of the present paper is that Sapolsky and the editors 
were in fact following, with reasonable consistency, our tradition 
chological research, but that, as the SapolskY 
O , ast two of these rules should be reconsidered- 
= a ahs rules examined here asserts roughly the following: Koe 
D ediction or hypothesis derived from a theory is confirmed bY 
periment, a nontrivial increment in one’s confidence in that theory 


D. T. Lykken 265 


should result, especially when one’s prior confidence is low.” Clearly, 
my 20 colleagues were violating this rule here since their confidence 
in the frog responder-cloacal birth theory was not, on the average, 
increased by the contemplation of Sapolsky’s highly significant chi- 
square. From their comments it seems that they found it too hard to 
accept that a belief in oral impregnation could lead to frog responding 
merely because the frog has a cloacus. (One must, after all, admit that 
few patients know what a cloacus is or that a frog has one and that those 
few who do know probably will also know that the frog’s eggs are both 
fertilized and hatched externally so neither oral impregnation nor anal 
birth are in any way involved. Hence, neither the average patient nor 
the biologically sophisticated patient should logically be expected to 
employ the frog as a symbol for an unconscious belief in oral con- 
ception.) My colleagues, on the contrary, found it relatively easy to 
believe that the observed association between frog responding and 
eating problems might be due to some other cause entirely (e.g, both 
Symptoms are immature or regressive in character; the frog, with its 
disproportionately large mouth and voice may well constitute a 
common orality totem and hence be. associated with problems in the 
Oral sphere; “squeamish” people might tend both to see frogs and to 
ave eating problems; and so on.) x 
Assuming that this first rule is wrong in this instance, perhaps it 
could be amended to allow one to make exceptions in cases resembling 
this illustration. For example, one could add the codicil: “This rule 
may be ignored whenever one considers the theory in question to be 
overly improbable or whenever one can think of alternative explana- 
tions for the experimental results.” But surely such an amendment 
would not do. ESP, for example, could never become scientifically 
respectable if the first exception were allowed, and one consequence of 
€ second would be that the importance attached to one's findings 
Would always be inversely related to the ingenuity of one’s readers. 
The burden of the present argument is that this rule is wrong not only 
in a few exceptional instances but as it is routinely applied to the majority 
of eXperimental reports in the psychological literature. 


Corroborating theories by 
experimental confirmation 
of theoretical predictions! 
studies of the 


Most Psychological experiments are of three kinds: Gen oe 


effect of some treatment on some output variables, 


l- Much of the argument in this section is based upon ideas developed in 
Certain unpublished memoranda by P. E. Meehl (personal communication, 


1963) and in a recent article (Meehl, 1967). 
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regarded as a special case of (b) studies of the difference between two or 
more groups of individuals with respect to some variable, which in 
turn are a special case of (c) the study of the relationship or correlation 
between two or more variables within some specified population. 
Using the bivariate correlation design as paradigmatic, then, one notes 
first that the strict null hypothesis must always be assumed to be false 
(this idea is not new and has recently been illuminated by Bakan, 1966). 
Unless one of the variables is wholly unreliable so that the values 
obtained are strictly random, it would be foolish to suppose that the 
correlation between any two variables is identically equal to .0000 . . - 
(or that the effect of some treatment of the difference between two 
groups is exactly zero). The molar dependent variables employed in 
psychological research are extremely complicated in the sense that 
the measured value of such a variable tends to be affected by the 
interaction ofa vast number of factors, both in the present situation and 
in the history of the subject organism. It is exceedingly unlikely that 
any two such variables will not share at least some of these factors 
ce equally unlikely that their effects will exactly cancel one another 
a It might be argued that the more complex the variables the smaller 
: eir average correlation ought to be since a larger pool of common 
Se peace more chance for mutual cancellation of effects in obedi- 
of E al pose Large Numbers. However, one knows of a number 
bh ee ent and pervasive factors which operate to unbalance 
ai: pl we and to produce correlations large enough 
ae T ee casual factors the experimenter may 
Ghysical vaviahles tend ro te know that (a) “good” psychological an 
Without deliberage oad to be Positively correlated; (b) experimenters, 
at Tate intention, can somehow subtly bias their findings 
e expected direction (Rosenthal, 1963); (c) the effects of common 
method are often as strong as or stronger than those produced by the 
al variables of interest (e.g, in a large and careful study of the 
eo structure of adjustment to stress among officer candidates. 
ea Fie Bitterman, 1956, found that their 101 original variables 
cating ie common factors representing, respectively. their 
Tet fherr EA zeir perceptual-motor tests, the McKinney Repos 
such as the subject's n = ees ee ging seg ey hag 
may broadly affect all nxiety level, fatigue, or his desire to ple al 
on measures obtained in a single experimen 
r A ai Seen variance of “unrelated” variables can be 
It would be ‘i ind of ambient noise level characteristic of the domain. 
teresting to obtain empirical estimates of this quantity 
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in our field to serve as a kind of Plimsoll mark against which to compare 
obtained relationships predicted by some theory under test. If, as I 
think, it is not unreasonable to suppose that “unrelated” molar 
psychological variables share on the average about 4% to 5% of 
common variance, then the expected correlation between any such 
variables would be about .20 in absolute value and the expected 
difference between any two groups on some such variable would be 
nearly .5 standard deviation units. (Note that these estimates assume 
zero measurement error. One can better explain the near-zero 
Correlations often observed in psychological research in terms of 
unreliability of measures than in terms of the assumption that the true 
Scores are in fact unrelated.) f 
Suppose now that an investigator predicts that two variables are 
positively correlated, Since we expect the null hypothesis to be false, 
we expect his prediction to be confirmed by experiment with a probi 
bility of very nearly .5; by using a large enough sample, me i 
can achieve any desired level of statistical significance for this result. 
If the ambient noise level for his domain is represented by tec 
Averaging, say, .20 in absolute value, then his chances of fin ave 
Statistically significant confirmation of his prediction witha ree A 
Sample size will be quite high (e.g. about 1 in 4 for N = 100) e z 
there is no truth whatever to the theory on which the predicto via 
based. Since most theoretical predictions in psychology, bel ate 
the areas of clinical and personality research, specify no more than F 
irection of a correlation, difference or treatment effect, we A 
accept the harsh conclusion that a single experimental a 2 a 
Usual kind (confirming a directional prediction), no matter ho Eh 
lts Statistical significance, will seldom represent a large AEN 
increment of corroboration for the theory from which it pare aoe 
to merit very serious scientific attention. (In the natural scien I 
Problem is far less severe for two reasons: (a) aca ee pe ote 
enough to generate point predictions or at least predic ai to-le: 
narrow range within which the dependent variable is xpe i 
and (b) in these sciences, the degree of experimental con ‘he ambien 
relative simplicity of the variables studied are such ihar ue 
noise level represented by unexplained and unexpecte fi > 
ifferences, and treatment effects is often vanishingly small. 


The Significance 
of large Correlations 

: : x diction 
It might be argued that, even where only a weak directional pre 


'S made, the obtaining of a result which is not only statistically 
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significant but large in absolute value should constitute a stronger 
corroboration of the theory. For example, although Sapolsky pre- 
dicted only that frog responding and eating disorders would be 
positively related, the fourfold point correlation (phi coefficient) 
between these variables in his sample was about .46, surely much 
larger than the average relationship expected between random pairs 
of molar variables on the premise that “everything is related to every- 
thing else.” Does not such a large effect therefore provide stronger 
corroboration for the theory in question? 

One difficulty with this reasonable sounding doctrine is that, in 
the complex sort of research considered here, really large effects, 
differences, or relationships are not usually to be expected and, when 
found, may even argue against the theory being tested. To illustrate 
this, let us take Sapolsky’s theory seriously and, by making reasonable 
guesses concerning the unknown base rates involved, attempt to 
estimate the actual size of the relationship between frog responding 
and eating disorders which the theory should lead us to expect. 
Sapolsky found that 1 % Of his control sample showed eating dis- 
orders; let us take this value as the base rate for this symptom among 
patients who do not hold the cloacal theory of birth. Perhaps we can 
assume that all patients who do hold this theory will give frog responses 
but surely not all of these will show eating disorders (any more than 
will all patients who believe in vaginal conception be inclined to show 
coital or urinary disturbances); it seems a reasonable assumption that 
no more than 50% of the believers in oral conception will therefore 
manifest eating problems, Similarly, we can hardly suppose that the 
pepe apaya implies an unconscious belief in the cloacal 
eed a response can come to be emitted now and then for 

. Even with the greatest sympathy for Sapolsky’s point 


of view, we could hardly expect more than, say, 50 % of frog responders 


to believe in oral impregnati ree bly 
K gnation. Therefore, ht reasona 
predict that 16 of 100 nonrespo Ea A 


$ nders would show eating disorders in & 
test of this theory, 50 of 100 frog responders would hold the cloaca 
theory and half of these show eating disorders, while 16°% or 8 of the 
remaining 50 frog responders will show eating problems too, giving # 
total of 33 eating disorders among the 100 frog responders. Such a 
finding would Produce a significant chi-square but the actual degree © 
fe nonship as indexed by the phi coefficient would be only about .20. 

fee words, if one considers the supplementary assumptions 
xN is Would be required to make a theory compatible with the actu@ 
Se ts obtained, it becomes apparent that the finding of a really strong 
te paeuon may actually embarrass the theory rather than support Í 

8» Sapolsky’s finding of 61 % eating disorders among his frog 
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responders is significantly larger (p < .01) than the 33% generously 
estimated by the reasoning above). 


Multiple corroboration 


In the social, clinical, and personality areas especially, we must expect 
that the size of the correlations, differences, or effects which might 
reasonably be predicted from our theories will typically not be very 
large relative to the ambient noise level of correlations and effects 
due solely to the “all-of-a-pieceness of things.” The conclusion seems 
inescapable that the only really satisfactory solution to the problem of 
corroborating such theories is that of multiple corroboration, the deriva- 
tion and testing ofa number of separate, quasi-independent pred 
Since the prior probability of such a multiple corroboration, ath e 
on the order of (.5)", where n is the number of og oa pre ie 
tions experimentally confirmed, a theory of any useful degree ol 
predictive richness should in principle allow for sufficient empiric: 
confirmation through multiple corroboration to compel the respec 

of the most critical reader or editor. 


The relation of experimental 
findings to empirical facts 


We turn now to the examination of a second popular a r 
evaluation of psychological research, which states zooan iat 
“When no obvious errors of sampling or experimenta ban ie 
apparent, one’s confidence in the general proposition ere aa a 
(e.g, Variables A and B are positively correlated in Aa eee 
Should be proportional to the degree of statistical sign iar 
obtained.” We are following this rule when we a hea Ie 
Sapolsky has at least demonstrated an empirical 4 ; 3 a wa t 
frog responders have more eating disturbances than Panen : = hae 
his conclusion means, of course, that in the lig 2 pee 
highly significant findings we should be willing to give vg a 7 
Odds that any other competent investigator pod r A a 
administering the Rorschach in his own way, an el pire 
Presence of eating problems in whatever manner seems ae 
and convenient for him) will also find a substantial positive re p 


between these two variables. 


me theory are seldom strictly independent 
he same supplementary assumptions, are 
nd so on. 


2. Tests of predictions from the sai 
since they often share some of t he 
made at the same time on the same sample, 


pe -r 


] 
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Let us be more specific. Given Sapolsky’s fourfold table showing 
19 of 31 frog responders to have eating disorders (61%), it can be 
shown by chi-square that we should have 99% confidence that the 
true population value lies between 13/31 and 25/31 (between 42% and 
81%). With 99% confidence that the population value is at least 13 
in 31, we should have .99(99) = 98% confidence that a new sample 
from that population should produce at least 6 eating disorders 
among each 31 frog responders, assuming that 5 of each 31 nonrespon- 
ders show eating problems also as Sapolsky reported. That is, we 
should be willing to bet $98 against only $2 that a replication of this 
experiment will show at least as many eating disorders among frog 
responders as among nonresponders. The reader may decide for 
himself whether his faith in the “empirical fact” demonstrated by this 
experiment can meet the test of this gambler’s challenge. 


Three kinds of replication 


If, as suggested above, “demonstrating an empirical fact” must involve a 
claim of confidence in the replicability of one’s findings, then to clearly 
understand the relation of statistical significance to the probability 
of a “successful” replication it will be helpful to distinguish between 
three rather different methods of replicating or cross-validating an 
experiment. Literal replication, of course, would involve exact duplica- 
tion of the first investigator’s sampling procedure, experimental condi- 
tions, measuring techniques, and methods of analysis; asking the 
original investigator to simply run more subjects would perhaps be 
about as close as we could come to attaining literal replication and eve? 
this, in psychological research, might often not be close enough. 
In the case of operational replication, on the other hand, one strives tO 
duplicate exactly just the sampling and experimental procedures given 
in the first author’s report of his research. The purpose of operationa 

replication is to test whether the investigator's “experimental recipe’ — 
the Conditions and Procedures he considered salient enough to be liste 

in the “Methods” section of his report—will in other hands produce 
the Tesults that he obtained. For example, replication of the “Cleve? 
Hans” experiment revealed that the apparent ability of that remarkable 
horse to add numbers had been due to an uncontrolled and unsuspecte 

factor (the presence of the horse’s trainer within his field of view). This 
factor, not being specified in the “methods recipe” for the result, was 
omitted in the replication which for that reason failed. Operation@ 


D. T. Lykken 271 


replication would be facilitated if investigators would accept more 
responsibility for specifying what they believe to be the minimum 
essential conditions and controls for producing their results. Psy- 
chologists tend to be inconsistently prolix in describing their experi- 
mental methods; thus, Sapolsky tabulates the age, sex, and diagnosis 
for each of his 62 subjects. Does he mean to imply that the experi- 
ment will not work if these details are changed?—surely not, but then 
why describe them? 

In the quite different process of constructive replication, one 
deliberately avoids imitation of the first author’s methods. To obtain an 
ideal constructive replication, one would provide a competent investi- 
gator with nothing more than a clear statement of the empirical 
“fact” which the first author would claim to have established—for 
example, “psychiatric patients who give frog responses on the Ror- 
schach have a greater tendency toward eating disorders than do patients 
in general”—and then let the replicator formulate his own methods of 
Sampling, measurements, and data analysis. One must keep in mind 
that the data, the specific results of a particular experiment, are only 
seldom of any real interest in themselves. The “empirical facts” which 
we value so highly consist usually of confirmed conceptual or con- 
Structive (not operational) hypotheses of the form “Construct A is 
Positively related to Construct B in Population C.” We are interested 
In the construct “tendency toward eating disorders,” not in the datum 
“has reference made to overeating in the nurse’s notes for May 15th.” 
An operational replication tests whether we can duplicate our findings 
using the same methods of measurement and sampling; a constructive 
replication goes further in the sense of testing the validity of these 
methods, f 

Thus, if I cannot confirm Sapolsky’s results for patients from my 
hospital, assessing eating disorders by means of informant interviews, 
say, or actual measurements of food intake, then clearly Sapolsky has 
not demonstrated any “fact” about eating disorders among psychiatric 
patients in general. I could then revert to an operational replication, 
assessing eating problems from the psychiatric notes as Sapolsky did 
and selecting my sample to conform with the age, sex, and diagnostic 
Properties of his, although I might not regard this endeavor to be 
Worth the effort since, under these circumstances, even a successful 
Operational replication could not establish an empirical conclusion of 
any great generality or interest. Just as a reliable but invalid test can be 
Said to measure something, but not what it claimed to measure, so 
an experiment which replicates operationally but not constructively 
Could be said to have demonstrated something, but not the relation 


272 Research Problems in Psychology 


i izable to some broad reference 
eaningful constructs, generaliza l p 
a a which the author originally claimed to have establishe 


Relation of the significance 
test to the probability of 
a “successful” replication 


igni i be 

bability values resulting from Significance testing cani De 
fen pions one’s confidence in expecting a eee 
literal replication only. Thus, we can be 98% confident A one 
least 6 of 31 frog responders to have eating problems only i E 
duce all of the conditions of Sapolsky’s experiment with a nE 
fidelity, something that he himself could not undertake to aa high 
point. Whether we are entitled to anything approaching suc kaar 
result from an i eine 
icati i ether Sapolsky has accuratel, 
replication depends entirely upon whet r ap Jey has acai ae 


o hospitals, i 

eee wo 
have the same Correlates or meaning in bags 
Populations and therefore one would be reckless indeed to te 
odds on the outcome of even the most careful operational replic: 


at A ee . urse, 
The likelihood of a successful constructive replication is, of co 


3. This distinction between o 


have much in common wi 
calls “direct” and “systema 
context to which Sidman 
another animal or the sa 
maintaining the same ex) 
atic replication one allo 


—_* to 
perational and constructive replication ae fie 
th that made by Sidman (1960) between W arch 
tic” replication. However, in the operanitres® run 
directs his attention, “replication” means es 
me animal again; thus, direct replication pos 
perimental conditions in detail whereas in oot 

ws all supposedly irrelevant factors to ven eectly 
One subject to the next in the hope of demonstrating that one has co being 
identified the variables which are really in control of the behavior 
studied. 
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still smaller since it depends on the additional assumptions that Sapol- 
sky’s samples were truly representative of psychiatric patients in general 
and that his method of assessing eating problems was truly valid, 
that is, would correlate highly with a different, equally reasonable 
appearing method. 


Another example 


It is not my purpose, of course, to criticize statistical theory or method 
but rather to suggest ways in which these tools are sometimes misused 
Or misinterpreted by writers or readers of the psychological literature. 
Nor do I mean to abuse a particular investigator whose research report 
happened to serve as a convenient illustration of the components of 
the argument, An abundance of articles can be found in the journals 
Which exemplify these points quite as well as Sapolsky’s but space 
limitations forbid multiple examples. As a compromise, therefore, I 
Offer just one further illustration, showing how the application of these 
Same critical principles might have increased a reader’s—and perhaps 
even an editor's—skepticism concerning some research of my own. 
The purpose of the experiment in question (Lykken, 1957) was to 
test the hypothesis that the “primary” psychopath has reduced ability 
to condition anxiety or fear. To segregate a subgroup in which such 
Primary psychopaths might be concentrated, I asked prison psycholo- 
Sists to separate inmates already diagnosed as psychopathic person- 
alities into one group that met 14 rather specific clinical criteria 
Specified by Cleckley (1950, pp. 355-392) and to identify another group 
Which clearly did not fit some of these criteria. The normal control 
Subjects were comparable to the psychopathic groups in age, 1Q, and 
Sex. Fear conditioning was assessed using the GSR as the dependen 
variable and a rather painful electric shock as the unconditione 
Stimulus (UCS). On the index used to measure rate of conditore, 
€ primary psychopathic group scored significantly lower than did the 


ppntrols, By the usual reasoning, therefore, one ighi cone 
1S resul imary psychopaths are $ 
a E a aversive UCS, and this 


SOW to condition the GSR, at least with 
sf 3 ] 
€Mpirical fact in turn provides significant support oe iene wa 
Raters psychopaths have defective fear-learning ability (1.¢., 
anxiety IQ”). T 
ki But to anyone who has actually particip 
ind, this seemingly straightforward reasoning 
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how a sample obtained by a different investigator using eee 
defensible methods might perform on the tests which I emp sae 
Even with the identical sample, no two investigators are like A 
measure the GSR in the same way, use the same conditioned sum r 

(CS) and UCS or the same pattern of reinforced and CS-only pa S. 
Given even the same set of protocols, there is no standard formula 2 
obtaining an index of degree or rate of conditioning; the index I use : 
was essentially abitrary and whether it was a good one is a matter 0 

opinion. My own evaluation of the methods used. together Wee 
complex set of supplementary assumptions difficult to explicate, lea 5 
me to believe that these results increase the likelihood that primary 
psychopaths have slower GSR conditioning with an aversive UCS; 
I might now give odds of two to one that this empirical generalization 
is true and odds of three to two that another investigator would be 
able to confirm it by means of a constructive replication. But this 
already biased claim is far more modest than the one which is implicit 
in the significance testing operation, namely, “such a mean difference 


would only be expected 5 times in 100 if the [generalization] is not 
true.” 


This empirical generalization, about GSR conditioning, is derivable 
from the h 


ypothesis of interest, that psychopaths have a low anxiety 
IQ, by a chain of reasoning so complex and elliptical and so burdened 
with accessory assumptions as to be quite impossible to spell out in 


the detail required for rigorous logical analysis. Psychologists knowl- 
edgeable in thi 


but their opinions will not nece 


derivation could pass the scrutin 
confirmation icti 


conditioning, 
by, for example, con 
GSR conditioning is ret 
that these individuals h 
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that is, a denial that a low GSR index implied poor fear conditioning in 
their cases, 

_ A redeeming feature of this study was that two other related but 
distinguishable predictions from the *same hypothesis were tested 
at the same time, namely, that primary psychopaths should do as well 
as normals on a learning task involving positive reward but less well 
on an avoidance learning problem, and that they should be more 
willing than normals to choose embarrassing or frightening situations 
in preference to alternatives involving tedium, frustration, physical 
discomfort, and the like. Tests of thesé predictions gave affirmative 
results also, thus providing some of the multiple corroboration 
necessary for the hypothesis to claim the attention of other experi- 
menters. y 

Obviously, I do not mean to criticize the editor’s decision to publish 
my (1957) paper. The tendency to evaluate research in terms of 
mechanical rules based on the results of the significance tests should 
not be replaced by equally rigid requirements concerning replication 
Or corroboration. This study, like Sapolsky’s or most others in this 
field, can be properly evaluated only by a qualified reader who can 
substitute his own informed judgment and scientific intuition for the 
rigorous reasoning and experimental control that is usually not achiev- 
able in clinical and personality research. As it happens, subsequent 
Work has provided some encouraging support for my 1957 findings 

he two additional predictions mentioned above have received 
Operational replication (i.e., the same test methods used ina aiea 
context) by Schachter and Latené (1964). The prediction that ies o- 
Paths show slower GSR conditioning with an aversive UCS has ee 
constructively replicated (i.e„ independently tested with no perce 
COPY my procedures) by Hare (1965a). Finally, two additiona P 1Q 
ions from the theory:that the primary psychopath has a low anx 1Y 

ave been tested with affirmative results (Hare, 1965b; 1 k ast 
told, then, this hypothesis can now boast of having led to R 

ve quasi-independent predictions which have been expenima bar 
Confirmed and three of which have been replicated. The DR id be 
therefore entitled to serious consideration although one on that 
rash still to regard it as proven. At least one alternative hypotl ae 
he Psychopath has an unusually efficient mechanism Ir TTi ms 
“motional arousal, can account equally well for the existing finding: 

at, as is usually the case, further research is called for. 


Conclusions ore a 5 
The Moral of this story is that the finding of statistical a h 
Perhaps the least important attribute of a good experi ; 
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never a sufficient condition for concluding that a theory has been 
corroborated, that a useful empirical fact has been established with 
reasonable confidence—or that an experimental report ought to be 
published. The value of any research can be determined, not from the 
statistical results, but only by skilled, subjective evaluation of the 
coherence and reasonableness of the theory, the degree of experimental 
control employed, the sophistication of the measuring techniques, the 
scientific or practical importance of the phenomena studied, and so on. 
Ideally, all experiments would be replicated before publication but this 
goal is impractical. “Good” experiments will tend to replicate better 
than poor ones (and, when they do not, the failures will tend to be 
informative in themselves, which is not true for poor experiments) 
and should be published so that they may stimulate replication and 
extension by others. Editors must be bold enough to take responsi- 
bility for deciding which studies are good and which are not, without 
resorting to letting the p value of the significance tests determine 
this decision. There is little real danger that anything of value will be 
lost through this approach since the unpublished investigator can 
always resort to constructive replication to induce editorial acceptance 
of his empirical conclusions or to multiple corroboration to compel 
editorial respect for his theory. Since operational replication must 
really be done by an independent second investigator and since 
constructive replication has greater generality, its success strongly 
oe that an Operational replication would have succeeded also, 
ey puts one’s own work constructively, using 
ae pling an _ measurement procedures within the purview 

Same constructive hypothesis. If only unusually well done, 
provocative, and important research were published without prior 
authentication, operational replication of such research by others 
would become correspondingly more valuable and entitled to the 


respect now accorded capable replication i i 3 
A eplicatio ental 
sciences. plication in the other experim 
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theory-testing in 
psychology and physics: 


a methodological paradox 


Paul E. Meehl' 


The purpose of the Present paper is not so much to propound a doctrine 
or defend a thesis (especially as I should be surprised if either psycholo- 
gists Or statisticians were to disagree with whatever in the nature of 4 


mass of data, is to increase the diffi 
which the physical theory of interest n 
in psychology and some of the allie 


sign, instrumentation, or numerical 
culty of the “observational hurdle 

rust successfully surmount; whereas: 
d behavior sciences, the usual effect 
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1. 


I wish to express my indebtedness to Dr. David T. Lykken, conversations 


i l ulating my thinking along thes¢ 
ines, and whose views and examples have no doubt influenced the form © 
the argument in this paper. Foran application of these and allied considera- 
tions to a specific example of poor research in 
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of such improvement in experimental precision is to provide an easier 
hurdle for the theory to surmount. Hence what we would normally 
think of as improvements in our experimental method tend (when 
predictions materialize) to yield stronger corroboration of the theory 
in physics, since to remain unrefuted the theory must have survived a 
more difficult test; by contrast, such experimental improvement in 
Psychology typically results in a weaker corroboration of the theory, 
since it has now been required to survive a more lenient test [3] [9] [10]. 
Although the point I wish to make is one in logic and methodology 
of science and, as I think, does not presuppose adoption of any of the 
current controversial viewpoints in technical statistics, a brief exposition 
of the process of statistical inference as we usually find it in the social 
Sciences is necessary. (The philosopher who is unfamiliar with this 
subject-matter may be referred to any good standard text on statistics, 
Such as the widely used book by Hays [5] which includes a clear and 
Succinct treatment of the main points I shall briefly summarize here.) 
. , On the basis of a substantive psychological theory T in which he 
IS interested, a psychologist derives (often in a rather loose sense of 
derive’) the consequence that an observable variable x will differ as 
between two groups of subjects. Sometimes, as in most problems of 
Clinical or social psychology, the two groups are defined by a property 
the individuals under study already possess, e.g. social class, sex, 
diagnosis, or measured 1.Q. Sometimes, as is more likely to be the case 
iN Such fields as psychopharmacology or psychology of learning, the 
Contrasted groups are defined by the fact that the experimenter has 
Subjected them to different experimental influences, such as a are 
a reward, or a specific kind of social pressure. Whether the contraste: 
groups are specified by an “experiment of nature” where the invested 
takes the organisms as he finds them, or by a true “experiment” in the 
More usual sense of the word, is not crucial for the present argument; 
although, as will be seen, the implications of my puzzle for ae 
testing are probably more perilous in the former kind of researc than 
in me latter, i 
ccording to the tantive theory T, the two 
to differ on vanabie ha it is recognized that errors of (a) measure- 
me ms z a 1, produce some observed 
ent and (b) random sampling will, in general, prc rtheir total 
ifference between the averages of the groups studied, even F s 
Population did not differ in the true value of ¥ [= mean of x]. 


groups are expected 


E) : A ‘ i hether girls are brighter 
Xample: We are interested in the question w a aave nei 


an boys (i.e, that u, — Hp = Ôp > 0) > 
reliable measures of intelligence, and we are Ae eae 
Position to measure the intelligence of all boys and gir's a te opo 
thetica] Population about which we desire to make a genera 4 
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Instead we must be content with fallible I.Q. scores, and ve sampl 
of school children drawn from the hypothetical pope a SE 
of these sources of error, measurement error and ran a ee 
error, contributes to an untrustworthiness in the TRR Se 
obtain for the average intelligence X, of the boys and also or Xs, Ese 
of the girls. If we observe a difference of, say d = 51.Q. ee E 
sample of 100 boys and 100 girls, we must have some method to be 
whether this obtained observational difference between the two 2 the 
reflects a real difference or one which is merely apparent, i.c., ae o 
combined effect of errors of measurement and sampling. We Eh 
by means of a “statistical significance test,” the mathematics oy r 
is not relevant here, except to say that by combining the princip S 
probability with a characterization of the procedure by which A 
samples were constituted, and quantifying the variation in absene 
intelligence score within each of the two groups. being contras io 
it is possible to employ a formula which utilizes the observed pine oe 
together with the observed variations and sample sizes so as to anipe 
certain relevant kinds of questions. Among such questions is t 
following: “If there were, in fact, no real difference in average IQ. 
between the population of boys and girls, with what relative fregueno 
would an investigator find a difference—in relation to the observe 


intragroup variation—of the magnitude our observations have actually 
found”? 


The statistical hypothesis, that ther 
between boys and g 


e is no population difference 
irls in 1.Q., which is called the “null hypothesi. 
[Ho : ô = 0] is used to generate a random sampling distribution re 
the statistic (“t-test”) employed in testing the presence of a significan 
difference. If the observed data would be very improbable on the 
hypothesis that H, obtained, we abandon Ho in favor of its alternative. 
We conclude that since Hp is false, its alternative, i.e., that there exists B 
real average difference between the sexes, ains. it was 


populations. In recent 
that what is of theoretical interest is n 


“directional null hypothesis,” 
hypothesis Ho. If our substant to 
that the average LQ. of girls in the entire population exceeds tha He 
boys, we test the alternative to this statistical hypothesis about t 5 
population, i.e., that either the average I.Q. of boys exceeds that of gi" 
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(H2) or that there is no difference (Ho). That is, we adopt for statistical 
test (with the anticipation of refuting it) a disjunction of the old- 
fashioned point-null hypothesis Hy with the hypothesis H, that Hg is 
false and it is false in a direction opposite to that implied by our substan- 
tive theory. However, this directional null hypothesis (Hoz : 11,< 4), 
unlike the old-fashioned point-null hypothesis (Ho : 4y = 4), does 
not generate a theoretically expected distribution, because it is not 
precise, i.e., it does not specify a point-value for the unknown parameter 
ô = Hgirts — Hboys). However, we can employ it as we do the point-null 
hypothesis, by reasoning that if the point-null hypothesis Hy obtained 
in the state of nature, then an observed difference (in the direction that 
Our substantive theory predicts) of such-and-such, magnitude, has a 
calculable probability; and that calculable probability is an upper 
bound upon the desired (but unknown) probability based on 
Ho: : Hy < fy. That is to say, if the probability of observed girl- 
Over-boy difference (dys = X, — X,) arising through random error is p, 
given the point-null hypothesis Ho : Hg = Ms, then the probability 
of the observed difference arising randomly given any of the point- 
hypotheses constituting H, : ly < My Will of course be less than p. 
Hence p is an upper bound on this probability for the inexact direc- 
tional null hypothesis (Ho : Hy < Ho). Proceeding in this way directs 
Our interest to only one tail of the theoretical random sampling 
distribution instead of both tails, which has given rise to a certain 
amount of controversy among statisticians, but that controversy 1s not 
relevant here. (For an excellent clarifying discussion, see Kaiser [6]). 
Suffice it to say that having formulated a directional null hypothesis Ho 
which is the alternative to the statistical hypothesis of interest H,, and 
Which includes the point-null hypothesis Hp as one (very unlikely) 
Possibility for the state of nature, we then carry out the experiment with 
the anticipation of refuting this directional null hypothesis, thereby 
Confirming the alternative statistical hypothesis of interest (H,), 
and, since H, in turn was implied by the substantive theory T, of 


corroborating T. r 
In such eel we know in advance that we are in danger of 
Making either of two sorts of “errors,” not in the sense of committing 
Scientific mistakes but in the sense of (rationally) inferring ae is 
objectively a false conclusion. If the null hypothesis (point or Re 
tional) is in fact true, but due to the combination of measurement an 


) REES ; H, or 
samplin in a value which is so improbable upon H, 
g errors we obtain a ternative H,, we will then have 


o that we decide in favor of their al > : 
Committed what is known as an error of the first kind or ie I wae 
An error of the first kind is a statistical inference that the nul ypo en 
is false, when in the state of nature it is actually true. This means we wi 
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have concluded in favor ofa statistical statement H, which flowed as a 
consequence of our substantive theory T, and therefore we will believe 
ourselves to have obtained empirical support for T, whereas in reality 
this statistical conclusion is false and, consequently, such support for 
the substantive theory is objectively lacking Measurement and 
sampling error may, of course, also result in a_sampling deviation in 
the opposite direction; or, the true difference ô may be so small that 
even if our sample values were to coincide exactly with the true ones, © 
the sheer algebra of the significance test would not enable us to reach 
the prespecified level of statistical significance. If we conclude until 
further notice that the directional null hypothesis Ho, is tenable, on 
the grounds that we have failed to refute it by our investigation, then 
we have failed to support its statistical alternative H,, and therefore 
failed to confirm one of the predictions of the substantive theory T. 
Retention of the null hypothesis Hoz when it is in fact false is known as 
an error of the second kind or Type II Error. 

In the biological and social sciences there has been widespread 
Popian of the probabilities .01 or .05 as the allowable theoretical 
T of Type I errors. These values are called the 1% and 5% 
Boer smeared , Itis obvious that there is an inverse relationship 
ia e ee abilities of the two kinds of errors, so that if we adopt 
anlar atk ad which increases the frequency of Type I errors, such 
A Satare el e greater number of claims of statistically significant 
kowi Gro the ‘null hypothesis; and, therefore, in whatever 
is in reality foleg a al experiments performed the null hypothesis 

reality lalse, we will more often (correctly) conclude its falsity, i-e» 
we will thereby be reducing the proportion of Type II er 

Suppose we hold fixed the theoretically coleulable incidence of 
Type I errors. Thus w 5 ] coretically calculable incidence : 
em. e determine that if the null hypothesis is in fac 
hat ee 2 nature, we do not wish to risk erroneously con- 
3%, pd t it is alse more than, say, five times in 100. Holding this 
pitas en level fixed (which, as a form of scientific strategy: 
Ken ERE over backward not to conclude that a relationship exists 
— nt one, or when there is a relationship in the wrong 
“irection), we can decrease the probability of Type II errors 
Improving our experiment i at oe y yP era 
ways in kick a nt in certain respects. There are three gen 
fixed Type I a the frequency of Type II errors can be decreased us 
othe experien tes Ramey (a) by improving the fogial struct 
control of n improving experimental techniques such a 
cion a hee variables which contribute to rae ee 
test), and (c) by iao ePPear in the denominator of the significance 
difference in the neat a ae ofthe sample. Givena specified Tity 
ofa Type II error is ke Pe oe nnd = pofthe proba the 

Own as the power, and an improvement in 
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experiment by any or all of these three methods yields an increase ın 
power (or, to use words employed by R. A. Fisher, the experiments 
“sensitiveness” or “precision.”) For many years relatively little 
emphasis was put upon the problem of power, but recently this concept 
has come in for a good deal of attention. Accordingly, up-to-date 
psychological investigators are normally expected to include some 
preliminary calculations regarding power in designing their experi- 
ments. We select a logical design and choose a sample size such that 
it can be said in advance that if one is interested in a true difference 
Provided it is at least of a specified magnitude (i.e., if it is smaller than 
this we are content to miss the opportunity of finding it), the probability 
is high (say, 80%) that we will successfully refute the null hypothesis. 
See, for example, Cohen’s literature sampling [4] on the problem of 
Power. For an incisive critique of the whole approach, a critique 
which has been given far less respectful attention than it deserves 
(conspiracy of silence?). I recommend Rozeboom’s excellent contribu- 
tion [11]. But I should emphasize that my argument in this paper does 
not hinge upon the reader’s agreement with Rozeboom’s very strong 
attack (although I, myself, incline to go along with him). : 

It is important to keep clear the distinction between the substantive 
theory of interest and the statistical hypothesis which is derived from 
it [2]. In the I.Q. example there was almost no substantive theory or 
a very impoverished one; ie, the question being investigated was 
itself stated as a purely statistical question about the average LQ. of 
the two sexes. In the great majority of investigations in psychology 
the situation is otherwise. Normally, the investigator holds some 
Substantive theory about unconscious mental processes, or physio- 
logical or genetic entities, or perceptual structure, or about learning 
influences in the person’s past, or about current social pressures, which 
Contains a great deal more content than the mere statement that the 
Population parameter of an observational variable is greater for one 
group of individuals than for another. While no competent psychologist 
'S unaware of this obvious distinction between a substantive psy- 
chological theory T and a statistical hypothesis H is implied by it, in 
Practice there is a tendency to conflate the substantive theory with the 
Statistical hypothesis, thereby illicitly conferring upon T somewhat 
the same degree of support given H by a successful refutation of the 
null hypothesis. Hence the investigator, upon, finding an observed 

ifference which has an extremely small probability of occurring on the 
null hypothesis, gleefully records the tiny probability number 
“P < .001,” and there is a tendency to feel that the extreme smallness 
Of this probability of a Type I error is somehow transferrable to a small 
Probability of “making a theoretical mistake.” It is as if, when the 
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observed statistical result would be expected to arise only once in a 
thousand times through a Type I statistical error given Ho, therefore 
one’s substantive theory T, which entails the alternative H,, has 
received some sort of direct quantitative support of magnitude around 
999 [= 1 — .001]. 

To believe this literally would, of course, be an undergraduate 
mistake of which no competent psychologist would be guilty; I only 
want to point to the fact that there is subtle tendency to “carry over” 
a very small probability of a Type I error into a sizeable resulting 
confidence in the truth of the substantive theory, even among inves- 
tigators who would never make an explicit identification of the one 
probability number with the complement of the other. 

f One reason why the directional null hypothesis (Hoz : Hy S Lo) 
is the appropriate candidate for experimental refutation is the universal 
agreement that the old point-null hypothesis Ho : p, = p) is [quasi-] 
always false in biological and social science. Any dependent variable 
of interest, such as I.Q., or academic achievement, or perceptual speed, 
or emotional reactivity as measured by skin resistance, or whatever, 
depends mainly upon a finite number of “strong” variables character- 
istic of the organisms studied (embodying the accumulated results © 
their genetic makeup and their learning histories) plus the influences 
manipulated by the experimenter. Upon some complicated, unknown 
Soar function of this finite list of “important” determiners i5 
Pet oe an indefinitely large number of essentially “random 
$ Ba ich contribute to the intragroup variation and therefore 
Oost the error term of the Statistical significance test. In order for tw° 
groups which differ in some identified properties (such as social class: 
intelligence, diagnosis, racial or religious background) to differ not 
tbe see variable of interest, it would be necessary that 4 I 
Values: ae of the output variable have precisely the same average 
sin both groups, or else that their values should differ by a pattern 
Ape of difference which precisely counterbalance one another tO 
ithe ela of zero. Now our general background knowledge 
pecs sciences, or, for that matter, even “common sense” con 
Ear es makes such an exact equality of all determining variables 
i on accidental counterbalancing of them, so extremely 
negligibly coy Psychologist or statistician would assign more than * 
gligibly small probability to such a state of affairs. 


ea Suppose we are studying a simple perceptual-verbal task 
variable i of color-naming in school children, and the independe” 
ts father’s religious preference. Superficial consideration ™!£ 


P. E. Meehl 285 


suggest that these two variables would not be related, but a little 
thought leads one to conclude that they will almost certainly be related 
by some amount, however small. Consider, for instance, that a child’s 
reaction to any sort of school-context task will be to some extent 
dependent upon his social class, since the desire to please academic 
personnel and the desire to achieve at a performance (just because it 
Is a task, regardless of its intrinsic interest) are both related to the 
kinds of sub-cultural and personality traits in the parents that lead to 
upward mobility, economic success, the gaining of further education, 
and the like. Again, since there is known to be a sex difference in 
color-naming, it is likely that fathers who have entered occupations 
More attractive to “feminine” males will (on the average) provide a 
somewhat more feminine father-figure for identification on the part of 
their male offspring, and that a more refined color vocabulary, making 
closer discriminations between similar hues, will be characteristic of 
the ordinary language of such a household. Further, it is known that 
there is a correlation between a child’s general intelligence and its 
father’s occupation, and of course there will be some relation, even 
though it may be small, between a child’s general intelligence and his 
color vocabulary, arising from the fact that vocabulary in general is 
heavily saturated with the general intelligence factor. Since religious 
Preference is a correlate of social class, all of these social class factors, 
as well as the intelligence variable, would tend to influence color- 
naming performance. Or consider a more extreme and faint kind of 
relationship. It is quite conceivable that a child who belongs to a more 
liturgical religious denomination would be somewhat more color- 
oriented than a child for whom bright colors were not associated with 
the religious life. Everyone familiar with psychological research knows 
that numerous “puzzling, unexpected” correlations pop up all the time, 
and that it requires only a moderate amount of motivation-plus- 
ingenuity to construct very plausible alternative theoretical explana- 
tions for them. 

These armchair considerations are borne out 
Psychological and sociological investigations invo 
numbers of subjects, it is regularly found that almost a € 

ifferences between means are statistically significant. See, for example, 
the papers by Bakan [1] and Nunnally [8]. Data currently being 
analyzed by Dr. David Lykken and myself, derived from a huge 


Sample of over 55,000 Minnesota high school seniors, evea gee 
Significant relati ips in 91% of pairwise associations among 
elationships in 91% of p birth order, 


Congeries of 45 miscellaneous variables such as sex, DIT! ee 
religious preference, number of siblings, vocational choice, clu 


ut by the finding that in 
Iving very large 
Il correlations or 
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membership, college choice, mother’s education, dancing, interest in 
woodworking, liking for school, and the like. The 9% of non-signifi- 
cant associations are heavily concentrated among a small minority of 
variables having dubious reliability, or involving arbitrary groupings 
of non-homogeneous or non-monotonic sub-categories. The majority 
of variables exhibited significant relationships with all but three of the 
others, often at a very high confidence level (p < 10:6). 

This line of reasoning is perhaps not quite as convincing in the 
case of true experiments, where the subjects are randomly assigned by 
the investigator to different experimental manipulations. If the reader 
is disinclined to follow me here, my overall argument will, for him, be 
applicable to those kinds of research in social science which study the 
correlational relationships or group differences between subjects 

as they come,” but not to the type of investigation which constitutes 
an experiment in the usual scientific sense. However, I myself believe 
that even in the strict sense of ‘experiment, the argument is still strong, 
although the quantitative departures from the point-null Hy would be 
ee to run considerably lower on the average. Considering the 
a ep Betis in the brain is connected with everything else, 
Peet, ere exist several | general state-variables” (such as arousal, 
i en eg and the like) which are known to be at least slightly 
kay her y practically any kind of stimulus input, it is highly 
apply to an ‘ihe Psychologically discriminable stimulation which We 
ee Xperimental subject would exert literally zero effect upo” 

y aspect of his performance. The psychological literature abounds 


with examples of small but detectable influences of this kind. Thus it} 


known that if a subj i : 
ject memorizes a li s in the 
presence of a faint a list of nonsense syllables 


odor of peppermint, his recall will be facilitated 

ltr of that odor. Or, again, we know that individuals solving 
is Hie ems in a “messy” room do not perform quite as We 
id ie working in a neat, well-ordered surround. Again, 
subject is ene undergo a detectable facilitation when the thinking 
hissing a ag performing the irrelevant, noncognitive tas se 
ingenuit an dynamometer. It would require considera t 

genuity to concoct experimental manipulations, except the MO$ 


mini ‘ Hag 
inimal and trivial (such as a very slight modification in the wor 


or i : A y 
der of instructions given a subject) where one could have confidence 


ae ea de would be utterly without effect upo”, n 
ment drive, d ational level, attention, arousal, fear of failure, a a 
etc, etc, § ia to please the experimenter, distraction, social fe Zn 
Psychological theo O Sample, while there is no very “interest” 
ability. E a theory that links hunger drive with color-nam in 
“Myself would confidently predict a significant difference? 
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color-naming ability between persons tested after a full meal and 
persons who had not eaten for 10 hours, provided the sample size 
were sufficiently large and the color-naming measurements sufficiently 
reliable, since one of the effects of the increased hunger drive is 
heightened “arousal,” and anything which heightens arousal would be 
expected to affect a perceptual-cognitive performance like color- 
naming. Suffice it to say that there are very good reasons for expecting 
at least some slight influence of almost any experimental manipulation 
which would differ sufficiently in its form and content from the 
manipulation imposed upon a control group to be included in an 
experiment in the first place. In what follows I shall therefore assume 
mat the point-null hypothesis Hy is, in psychology, [quasi-] always 
alse, 

Let us now conceive ofa large “theoretical urn” containing counters 
designating the indefinitely large class of actual and possible substan- 
tive theories concerning a certain domain of psychology (e.g, mam- 
malian instrumental learning). Let us conceive of a second urn, the 
“experimental-design” urn, containing counters designating the in- 
definitely large set of possible experimental situations which the 
Ingenuity of man could devise. (If anyone should object to my con- 
Ceptualizing, for purposes of methodological analysis, such a hetero- 
geneous class of theories or experiments, I need only remind him that 
Such a class is universally presupposed in the logic of statistical 
Significance testing.) Since the point-null hypothesis Ho is [quasi-] 
always false, almost every one of these experimental situations involves 
4 non-zero difference on its output variable (parameter). Whichever 
group we (arbitrarily) designated as the “experimental” group and m 
Control” group, in half of these experimental settings the true va 
a dependent variable difference (experimental minus control) will be 

Ositive, and in the other half negative. , : 
It may be objected that thists use of the Principle of E 
ĉason and presupposes one particular answer to some ispi ra 
questions in statistical theory (as between the Bayesians an i 
'sherians). But I must emphasize that I have said nothing about ; 
orm or range or other parametric characteristics of the dae 
true differences, I have merely said that the point-null hypot nah o 
18 always false. and I have then assigned, in a strictly random ee 
“© Names “experimental” and “control” to the two groups W 


i at is, it makes 

no ca experimental setup treats in two ditecen: waa Toat ee 
Here a group of subjects seny 
eea meter is called the experi- 


Syllables ait A 
ile squeezing a hand dynamometer 

Mental] group, of whethee we call “experimental the group that pete 
© nonsense syllables without such squeezing. Hence my use 
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Principle of Insufficient Reason is one of those legitimate, non-contro- 
versial uses following directly when the basic principles of probability 
are applied to a specification of procedure for random assignment. 

We now perform a random pairing of the counters from the 
“theory” urn with the counters from the “experimental” urn, and 
arbitrarily stipulate—quite irrationally—that a “successful” outcome 
of the experiment means that the difference favors the experimental 
group [Hg — ue > 0]. This preposterous model, which is of course 
much worse than anything that can exist even in the most primitive of 
the social sciences, provides us with a lower bound for the expected 
frequency of a theory’s successfully predicting the direction in which the 
null hypothesis fails, in the state of nature (i.e, we are here not con- 
sidering sampling problems, and therefore we neglect errors of 
either the first or the second kind). It is obvious that if the point-null 
hypothesis H, is [quasi-]always false, and there is no logical connection 
between our theories and the direction of the experimental outcomes, 
then if we arbitrarily assign one of the two directional hypotheses H, 
or H, to each theory, that hypothesis will be correct half of the time, 
Le, in half of the arbitrary urn-counter-pairings. Since even my late, 
uneducated grandmother’s common-sense psychological theories had 
nonzero verisimilitude, we can safely say that the value p= zisa 
lower bound on the success-frequency of experimental “tests,” assuming 
our experimental design had perfect power. 

Countervailing the unknown increment over p = $ which arises 
from the fact that the experimental and theoretical counters are not 
thus drawn randomly (since our theories do possess, on the average, at 
least some tiny amount of verisimilitude), there is the statistical factor 
that among the counter-pairings which are accidentally “successfu 
(in the Sense that the state of nature falsifies the null hypothesis in the 
expected direction), we will sometimes fail to refute it because ° 
measurement and sampling errors, since our experiments will always: 
in practice, have less than perfect power. Even though the point-nU! 
hypothesis Ho is always false, so that the directional null hypothes!s 
: 02 1S false in the (theoretically pseudo-predicted) direction hal t 
ime, we will sometimes fail to discover this because of Type I errors. 
Without making illegitimate prior-probability assumptions concerning 
the actual distribution of true differences in the whole vast world ° 
P chological experimental contexts, one cannot say anything definite 
e ne extent to which this countervailing influence of Te 
a ul wash out (or even overcome) the fact that our theories aie 
facti is ey ee verisimilitude. But by setting aside this A 
GETA wee pt Acai sap that there is no conn a 
i dealization) 7 ur theories and our experimental designs (the tWO u 

} thereby fixing the expected frequency of success 
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refutations of the directional null hypothesis Hoz at p = 4 for experi- 
ments of perfect power ; it follows that as the power of our experimental 
designs and significance tests is increased by any of the three methods 
described above, we approach p = 4 as the limit of our expected 
frequency of “successful outcomes,” i.e, of attaining statistically 
eanficani experimental results in the theoretically predicted direc- 
ion. 

I conclude that the effect of increased precision, whether achieved 
by improved instrumentation and control, greater sensitivity in the 
logical structure of the experiment, or increasing the number of 
observations, is to yield a probability approaching 4 of corroborating 
Our substantive theory by a significance test, even if the theory is 
totally without merit. That is to say, the ordinary result of improving 
Our experimental methods and increasing our sample size, proceeding 
in accordance with the traditionally accepted method of theory- 
testing by refuting a directional null hypothesis, yields a prior proba- 
bility p x and very likely somewhat above that value by an unknown 
amount. It goes without saying that successfully negotiating an 
experimental hurdle of this sort can constitute only an extremely 
Weak corroboration of any substantive theory, quite apart from 
Currently disputed issues of the Bayesian type regarding the assignment 
of prior probabilities to the theory itself. Bs. 

So far as I am able to discern, this methodological truth is either 
unknown or systematically ignored by most behavior scientists. 

do not know to what extent this is attributable to confusion between 
the Substantive theory T and the statistical hypothesis Hy, with the 
resulting mis-assignment of the probability (1 — pı) complementary - 
© the significance level p; attained, to the “probability” of the substan- 
ae theory; or to what extent it arises from insufficient attention to the 
truism that the point-null hypothesis Ho is [quasi-Jalways false. It 
seems unlikely that most social science investigators would hiak HE 

eir usual way about a theory in meteorology which successi 
Predicted” that it would rain on the 17th of April, given the anireo 
information that it rains (on the average) during half theidaysiin is 


month of Apri 
pril! sai f 
But this į Inadequate appreciation © 
is e story. Inadeq : : 
not the worst of thi y ubstantive theory T is 


the ext ich 
reme weg which a sS j ` 
Subje eakness of the test to 4 1 statistical difference 


ji e he 
i Ight ee T > H,,H,, infer T), and on ibeattt fied prediction, the 
ier 
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While my own philosophical predilections are somewhat Popa 
perian, I daresay any reader will agree that no full-fledged ree 
philosophy of science is presupposed in what I have UD a nae 
destruction of a theory modus tollens is, after all, a matter of de oe 
logic; whereas that the “confirmation” of a theory by its ae a 
successful predictions involves a much weaker kind of inference. 
This much would be conceded by even the most ant- Foppe a 
“inductivist.” The writing of behavior scientists often reads as mone 
they assumed—what it is hard to believe anyone would explicitly 
assert if challenged—that successful and unsuccessful predictions are 
practically on all fours in arguing for and against a substantive theory. 
Many experimental articles in the behavioral sciences, and, even more 
strangely, review articles which purport to survey the current status 
of a particular theory in the light of all available evidence, treat u 
nces and the disconfirming instances with equal 
methodological respect, as if one could, so to speak, “Count noses, 
so that if a theory has somewhat more confirming than disconfirming 
instances, it is in pretty good shape evidentially. Since we know that this 
is already grossly incorrect on purely formal grounds, it is a mistake 
a fortiori when the socalled “confirming instances” have themselves i 
prior probability, as argued above, somewhere in the neighborhoo 
of 3, quite apart from any theoretical considerations. A 

Contrast this bizarre State of affairs with the state of affairs in 
physics. While there are of course a few 
in the experimental testing of a physical theory at least involves the 
prediction of a form i i 
more commonly, the 
value). Improvements 


theoretically predicted value. What does ne 
mean in terms of the significance-testing model? It means: In physics, 
that which corresponds, in the logical structure of statistical es ge 
to the old-fashioned point-null hypothesis H o is the value which Orr 
as a consequence of the substantive theory T; so that an iner a : 
What the statistician would call “power” or“precision” has the metho pd 
logical effect of stiffening the experimental test, of setting up a ae 
difficult observational hurdle for the theory T to surmount. Hence. F 
physics the effect of improving precision or power is that of decre ani 
the prior probability of a successful experimental outcome if the theo 


a “ye . . i i n 
lacks verisimilitude, that 1S, precisely the reverse of the situatio: 
obtaining in the social sciences, 
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As techniques of control a nent 

o Observations increases, the it HEE Ik aed A ae a 
ul passing of the hurdle wi pica Dati cesn physica ae 

corroboration of th urdle will mean a greater increment i 
Ganicarable tn i e substantive theory; whereas in payeholegy 
empirical test atin at the experimental level result in an 
TI rhe substantive naei e a Ea 

n physi ihe 

ice ia i eek theory predicts a point-value, and when 
compare the me significance tests,” their mode of employment is to 
Šo, asking wf Oa predicted value Xo with the observed mean 
probable error” ey differ (in either direction!) by more than the 

füctions asa 0 determination of the latter. Hence H : Hp = 4 
Probability poin aul hypothesis, and the prior (logical, antecedent) 
zero. As the expe eing correct in the absence of theory approximates 
Žo shrinks, oporni error associated with our determination of 
with its implicans of Xo consistent with Xo (and hence, compatible 
(zero nent ans T) must lie within a narrow range. In the limit 
test) any meds aa gia Bra power” in the significant 
refutation of ei, difference (Xo — Xo) provides a modus tollens 
Probability Fa the theory has negligible verisimilitude, the logical 
Psychology, the s surviving such a test 1s negligible. Whereas in 
non-zero es result of perfect power (ie, certain detection of any 
ility p = 4 of nce in the predicted direction) is to yield a prior proba- 
perfect me ee experimental results compatible with T, because 
exists: anda on ae mean guaranteed detection of whatever difference 
expected directi erence [quasi] always exists, being in the “theoretically 
Negligible veri ion” half the time if our substantive theories were all of 

This. m ee (two-urn model). 
even if he s odological paradox would 
existence payee ii own statistical game 
especially in Sbon, namely, that most psyc 
Psycholog the so-called “soft” fields such as soc l 
able to E are not quantitatively developed to the extent of being 
State of aff rate point-predictions. In this respect, then, although this 
Of view airs is surely unsatisfactory from the methodological point 
» and stands in great need of clarification (and, hopefully, of 


Constry 3 
ctive suggestions for improving it) from logicians and philoso- 
“nobody's fault,” it being 


Pners 
i ee one might say that it is i 1 | 
rom this dia just how the behavior scientist could extricate himself 
e Shien emma without making unrealistic attempts at the prema- 
Senerate Tuction of theories which are sufficiently quantified to 
er predictions for refutation. i 7 
ver, there are five social forces and intellectual traditions at 


exist for the psychologist 
fairly. The reason for its 
hological theories, 
jal and personality 


292 Research Problems in Psychology 


work in the behavior sciences which make the research consequences 
of this situation even worse than they may have to be, considering the 
state of our knowledge. In addition to (a) failure to recognize the 
marked evidential asymmetry between confirmation and modus 
tollens refutation of theories, and (b) inadequate appreciation of the 
extreme weakness of the hurdle provided by the mere directional 
significance test, there exists among psychologists (c) a fairly wide- 
spread tendency to report experimental findings with a liberal use of 
ad hoc explanations for those that didn’t “pan out.” This last methodo- 
logical sin is especially tempting in the “soft” fields of (personality and 
social) psychology, where the profession highly rewards a kind of 
“cuteness” or “cleverness” in experimental design, such as a hitherto 
untried method for inducing a desired emotional state, or a particularly 
subtle” gimmick for detecting its influence upon behavioral output. 
The methodological price paid for this highly-valued “cuteness” is, 
of course, (d) an unusual ease of escape from modus tollens refutation. 
; For, the logical Structure of the “cute” component typically involves 
use of complex and rather dubious auxiliary assumptions, which are 
required to mediate the original prediction and are therefore readily 
available as (genuinely) plausible “outs” when the prediction fails. 
It is not unusual that (e) this ad hoc challenging of auxiliary hypotheses 
is repeated in the course of a series of related experiments, in which the 
auxiliary hypothesis involved in Experiment 1 (and challenged ad hoc 
in order to avoid the latter's modus to 


hile our eager-beaver researcher, undismaye 

sral relying blissfully on the 
| Statistical hypothesis-testing, has produced @ 
ong Publication list and been Promoted to a full professorship. | In 


potent-but-sterile intellectu 
a long train of Travished 
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Detailed elaboration of the intellectual vices (a)-(e) and their 


scientific consequences must be left for another place, as must construc- 
tive suggestions for how the behavior scientist can improve his situation. 
My main aim here has been to call the attention of logicians and 
philosophers of science to what, as I think, is an important and difficult 
problem for psychology, or for any science which is largely in a 
primitive stage of development such that its theories do not give rise to 
point-predictions. 
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testing the null hypothesis 
and the strategy and 
tactics of investigating 
theoretical models 


David A. Grant! 


against alternatives, H,, is well 


Testi 
esting the null hypothesis, Ho, 
tific research. However, this 


e ted and has a proper place in scientific ) i 
nent G procedure, when it is routinely applied to comparing experi- 
Fore ones with outcomes that are quantitatively predicted 
implic t eoretical model, can have unintended results and bizarre 
testin a This paper first outlines three situations 1n which 
the & Ho has conventionally been done by psychologists. In terms of 
fee intentions or strategy of the experimenter testing Ho 

out to be an appropriate tactic in the first situation, but it 1s 
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inadequate in the second situation, and it is self-defeating with curious 
implications in the last situation. Alternatives to this conventional 
procedure are then presented along with the considerations which make 
the alternatives preferable to testing the usual Ho. 


Three applications 
of H, testing 


Probably the most common application of the tactic of testing Ho 
arises when the independent variable has produced a sample difference 
or set of differences in the magnitude of the dependent variable. 
Quantitative predictions of the size of the difference or differences are 
not available. The experimenter wishes to know whether or not 
differences of the size obtained could have occurred by virtue of the 
Operation of the innumerable nonexperimental factors conventionally 
designated as 
determines the set of hypotheses 
l ling to entertain, H,; selects an 
appropriate test statistic, t, F, x7, U, T or the like; and proceeds 
with the test. Rejection of Ho permits him to assert, with a precisely 
defined risk of being wrong, that the obtained differences were 
ation. Failure of the test to permit 
ately, is commonly termed “accepting 
flerences or greater ones would occur 
eater than g. This situation is straight- 
s limited aims. He has asked a simple 


» and he devoutly hopes that the experimental and control 
ome € andom differences. He is now relieved O! 
chagrined, depending upon whether Hy is “accepted” or “rejecte 
asa consequence of his test. Even if Ho is accepted his relief is temper? 
by some uneasiness, He knows that he has not proved, and indec¢ 
cannot prove, that Ho is “true.” His tactics in testing Hy seem tO ba 
ae inge to the impossible Strategic aim of proving the truth of Ho: 
: ertainly, ifhe had a more reasonable aim he has adopted inapproprial® 
chon oiilizing these tactics, the best hé-can do sto bento stralen 
retreat, and if Hy is accepted he can perhaps point out that he has us 
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ave P 
likely st fesi and that if there were real differences they were most 
ae ae i Although psychologists have never to my knowledge 
{esting A t be able to go one step further and point out that his 
90°% ithe t ure would reject Ho a given percentage of the time, say, 
Thi. Sie _ difference had been as little as, say, one-tenth ofan SD. 
Neen f statement of the power of a test is a commonplace in 
Leta Sate (Grant, 1952, Ch. 13). 
e advent of more detailed mathematical dels i 

ch ical models in psy- 
a CE puh, Abelson and Hyman, 1956; Bush and Estes, 1959; 
testing a 7 $ Kemeny, Snell, and Thompson, 1957) a new statistical 
the ete ion is arising more and more frequently. The specificity of 
ae ions and perhaps the whole philosophy behind model 
faced ion pose a different kind of statistical problem than those 
that as inai psychological investigators in the past. It seems obvious 
Brest 9 use of models becomes more widespread a greater number of 
ene ors will face the problem of evaluating the correspondence 
tess Shee tk data points and precise numerical predictions of 
testing TA Unfortunately most of the procedures used to date in 
example e adequacy of such theoretical predictions set rather tad 
that in eee the least adequate of these procedures has been 
empiri which an Ho of exact correspondence between theoretical and 

ical points is tested against H, covering any discrepancy between 


ne and experimental results. 
the ee predict a considerable number of different aspects of 
than at and some of these aspects are predicted with greater success 
thétiver hers (Bush and Estes, 1959, Chs. 14, 15, 17, 18). We shall 
might on discussion to the prediction of values along a curve which 
situatio >a learning curve. An idealized version of such a typical 
Plotted n is presented in Figure 1. Here, the dependent variable, Y, is 
horiz on the vertical axis against the independent variable, X, on the 
y’ azoni] axis. The theoretical model has led to an expression, 
acl (X), giving a set of k theoretical predictions, Yi, Ya.---> Yk 
Sag ge has produced k empirical data points, a set of mean 
TEN ¥,, %,..., Ý, corresponding to the values of the independent 
Oisen that were investigated, namely, r Xz Ak Individual 
and th ations tend to form normal distributions about each of the Y; 
nf ese normal distributions tend to have equal o's for all data points. 
n further discussion we shall assume that inaccuracies in the manipula- 


t 
‘on of the independent variable, X, can be ignored. The problem now 
t of the Y; to the Y; or the correspond- 


as corroborating the 
| observation from the 


is to j : 
“te sae the goodness of fi 
te the Y; and the Y; 
theo e tactics oriented toward acr 
y involve breaking down the jth i 


rd accepting Ho 
ndividua 
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Dependent Variable 


Independent Variable 


Fig. 1 

Idealized situation involving the test of a theoretical function, Y' = Fl (x). 
(Theoretical points, Y;, are represented by open circles; obtained 
means, ¥;, are represented by solid circles.) 


general mean of all of the observations, as follows: 
Yy~ Y= (¥y— ¥) + (% — v9 A a 


where Y, is the jth observation in the ith normal distribution, and 


Y is the general mean ofall observations, The total sum of squares may 
then be partitioned as follows: 


is) 
Stu = SSoert + Soer thay + SSmo, g 


where SSpey es iS the sum of Squares associated with the variation of 
individual measures from their means, SSpev theory IS the sum of a oat 
associated with the Systematic departures of empirical data points from 
the theoretical points, and SStheory iS the sum of squares associate’ 

with departures of the theoretical points from the general mean of the 


whole experiment. If We suppose that the linear model for the analysis 
of variance holds, then: 


Vy = u+ T; + Di +e; (3) 


. . i € S 
where y is the Population mean for all Yj over the specific values 
of the independent variable, X.: i 


an of zero and variance, hat 
for all i. For a fixed set of X; the T’s and D’s may be defined so t 


ZT; = 2D; = 0. Under Ho each D; = 0, Under H; some D; # 0, 


y „med 
and the variance of the Di, op # 0. This last variance may be terme 
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the true variance of the discrepancies from the theory over the 
Particular set of X; that was investigated. 

The foregoing is a conventional analysis of variance model, and 
the F ratio of the MSpey theory divided by the MSpey r provides an 
excellent and powerful test of Ho against H,. The number of degrees 
of freedom for SSpey es Will be k(n — 1) where n is the number of 
observations per data point, and the degrees of freedom for 
SSpey Theory Will be k — ny, where ny is the number of degrees of freedom 
lost in the process of fitting the model to the data. Ifthis F is significant, 
We reject Hy, concluding that the discrepancies between the Y; and 
the Y; are too great to be accounted for by the observed random 


Variation in the experiment. In this conclusion we accept the 5% or 


1% risk implied by our choice of a. f 
Logical difficulties arise when F fails of significance. Ho remains 
tenable but is not proved to be correct. A tenable Ho provides some 
Support for the theory but only in the negative sense of failing to provide 
Evidence that the theory is faulty. To assert that accepting the Ho 
Proves that the model provides a satisfactory fit to the data is an 
accurate and misleading statement. We may mean that we are 
Satisfied, but others, especially proponents of other theories, will tend 
to regard our test as too lenient. A 
Failure to reject Ho, instead of producing closure, leaves certain 
annoying ambiguities, but the tactics of testing this particular Ho 
Imply a strategy that suffers from more serious defects that are ear 
apparent when the whole conception of testing a theory 1s soe 
considered. To begin with, in view of our present pe 
Nowledge and the degree of refinement of available theoretica’ 


models it s i even the best and most useful theories 
fee Uterine | lysis of variance model, 


are Not per is in terms of the ana model, 
there will ans ee te Ho, then, is never really Riss 
lis “acceptance,” rather than “proving” the theory, merely # eT 
that in this instance the D’s were too small to be demonstrated by e 
Sensitivity of the experiment in question. The tactics Ter io 
àS Proof and rejecting Ho as disproof ofa theory lead to the an gs 
results that a small-scale, insensitive experiment will me (o a 
interpreted as favoring a theory, whereas a large-sca G sensitive 
Xperiment will usually yield results opposed to the theory! BENE 

_ Curiously enough, even rejection of Ho by <r o! y 
stringent experimental test may be quite misleading as far as casting 
light on the adequacy of the theory is concerned. If the D;s are very 
Small indeed the theoretical model may be a great improvement over 
anything else that is available and satisfactory for many popne even 

Ough an extremely sensitive experiment were to reveal the nonzero 
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Djs. If our task, as scientists, were to test and accept or reject theories 
as they came off some assembly line the tactics of testing Hg could be 
made in a satisfactory manner simply by requiring that the test be 
“sufficiently” stringent. In fact, our task and our intentions are usually 
different from testing products; what we are really up to resembles 
quality control rather than acceptance inspection, and statistical 


procedures suitable for the latter are rarely optimal for the former 
(Grant, 1952, Chs. 1, 13). 


Hypothesis testing 
ys. statistical estimation 


An analogy will make clear the relation between testing tactics and 
the intention of the tester, Suppose that I wish to test a parachute; 
how should I go about it? How I should test depends upon my general 
intentions. If I want to sell the parachute and am testing it only to be 
able to claim that it has been tested, and I do not care what happens 
to the purchaser, then I should give the parachute a most lenient, 
nonanalytic test. If, however, I am testing the parachute to be sure of it 
for my own use, then I should subject it to a very stringent, nonanalytic 
test. But if I am in the competitive business of manufacturing an 
selling parachutes, then I should subject it to a searching, analytic test, 
designed to tell me as much as Possible about the locus and cause © 
any failure in order that I may improve my product and gain a large" 
share of the parachute market. My contention is that the last situation 
is the one that is most analogous to that facing the theoretical scientist. 
He 1S not accepting or rejecting a finished theory; he is in the long-ter™ 
business of Constructing better versions of the theory. Progres® 
depends upon improvement or Providing superior alternatives, an 
improvement will ordinarily depend upon knowing just how good the 
model is and exactly where it seems to need alteration. The large Dj 3 
designate the next point of attack in the continuing project of refining 
the existing model. Therefore attention should be focused upon the 
various discrepancies between prediction and outcome instead of 0” 
the over-all adequacy of the model. 

In view of our long-term Strategy of improving our theories, our 
Statistical tactics can be greatly improved by shifting emphasis away 
from over-all hypothesis testing in the direction of statistical estimatio”: 

his always holds true when we are concerned with the actual s!” 
of one or more differences rather than simply in the existence ° 
differences, For example, in the second instance of hypothesis testing 
cited at the beginning of this paper, where the investigator tests, 
Pre-experimental difference, he would do better to obtain 95% or 99 % 
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confidence interval for the pre-experimental difference. If the interval 
1s small and includes zero, he (and any other moderately sophisticated 
Person) knows immediately that he is on fairly safe ground; but if the 
interval is large, even though it includes zero, it is immediately apparent 
that the situation is more uncertain. In both instances Hp would have 
been accepted. 


Testing a revised Ho 


Before turning to estimation procedures that are useful in examining the 
correspondence between experimental outcomes and predictions from 
a mathematical model, I shall digress briefly to outline a statistical 
testing method which can legitimately be used in appraising the fit ofa 
Model to data as shown in Figure 1. Basically the statistical argument 
in the proper test is reoriented so that rejection of Ho constitutes 
evidence favoring the theory. The new Ho is that the correlation 
between the predicted values, Y;, and the obtained values, Y;, is zero, 
after all correlation due to the fitting process has been eliminated. The 
alternative, H,, against which Ho is tested is that there is a correlation 
greater than zero between theoretical and empirical points. The four 
simple steps required to obtain the necessary F test are as follows: 


l. Calculate t; = Y; — F foralli. £t; = 0? 
2. Calculate SS, ence = NE 1,¥;)?/2 t?, where n is the number 
of observations upon which each Y, is based. Negative values of 


=t,Y;) are treated as zero. 

number 
3. Obtain MScorrepondence = SScorsespondence/Mtz» WhEES nir ea ame 
of degrees of freedom involved in fitting the a Not Ropi e 
empirical data, will ordinarily be the number of DR A p! 
fitting constants in the mathematical expression © the : 


4. Divide MScursapasence DY MSpev p t0 give Feowagantence WHICH Has 
ny degrees of freedom for its numerator and k(n ~ 1) degre a 
for its denominator, k being the number of Y;. The test is one ae 
the sense that negative values of £ 4Y are treated as zero values, 


2. In the unusual event where the general mean of the ae phere 
cet Dolce hesose: all A oe erent sal then be insensitive 
deviation: the mean of all the Yj. Y- ill ther 
to ee between Y’ and Y and the interpretation will r sortint 
equivocal. A separate test of Ho that Yooputation oe la pera 
here the experimenter is forced into the illicit postur 


0- 


ae 
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that the probability values of the F distribution must be halved, an 
unusual procedure with F tests in analysis of variance. 


Following the above procedure, rejection of Hy now means that there 
is more than random positive covariation between predicted and 
obtained values of the dependent variable. 

This test is admirable in that it puts the burden of proof on the 
investigator, because a small-scale, insensitive experiment is unlikely 
to produce evidence favoring the model. Furthermore, if the model has 
any merit, the more sensitive the experiment, the more likely it is that a 
significant F, favoring the theory, will be obtained. Actually, the test is 
extremely sensitive to virtue in the theory, and therefore in the case of a 
moderately successful model and a moderately sensitive experiment 
both this F and the one testing the significance of systematic deviations 
from the model (F = MSpey theory/MSpey rx) Will tend to be significant. 
This outcome is no anomaly; it merely indicates that the model 
predicts some but not all of the systematic variation in the data. In 
short, progress is being made, but improvement is possible. The fact 
that simultaneous significance of both F's, indicating general success 
and specific failures of a model, should be a common-place points up 
the necessity of turning to methods of statistical estimation for a more 
adequate examination of the workings of a theoretical model. 


Practical estimation methods 
for investigation of models 


As is true of statistical tests, each method of statistical estimation has its 
advantages and limitations. In the investigation of the adequacy © 
theoretical curves in psychology there are reasons to believe that the 
simpler estimation methods have practical advantages over some of the 
more elegant procedures, To give a fairly complete view of the situation, 
methods of point and interval estimation of a3, and of the individual D; 
will be described, and a brief evaluation of each method will be given- 


Estimating o2. 


The variance of the discrepancies between the Y; and 
the Y, condenses 


into a single number the adequacy of fit of the theoreti- 
cal model. As such it is an excellent index for the evaluation of the 
model. The smaller the variance, c, the better the model, and vic? 
versa. Asan estimate of the size of the discrepancies one might expect ue 
future similar applications of the model, aż is far more informative 
than any F test. Furthermore ø is readily estimated in the case © 
homogeneity of the error variance, 62. The expected values of th? 


E fe a 
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relevant mean squares are as follows: 
EXP (MSpev theory) = Fe + nop (4) 
EXP (MSpey rs) = Fe 6) 


A maximum likeli i i 

e then: likelihood estimate of the variance of the discrepancies, 
= 

2 (MSpev theory — MSpev es)/n (6) 

h sage 
fe Er ocuracy of this estimator depends upon the number of degrees of 
E m associated with SSpev Theory and SSpey rs: The latter rarely 
ERR = practical problem, but the former, in view of the predilection 
AT en for minimizing the number of data points, is quite 
. This is readi i imati bi 
dttempted, eadily apparent when interval estimation of op 1s 
ee (1950) gives a convenient metho 
ers ducial interval for aĝ, and in this cas 
for ais essentially equal. The method wil 
e 5% interval. 


if Se. ; 
Bec tain 63, from Equation 6, above. (If the estima 
» Meaningful limits cannot be obtained.) 


2. Find: 


d for accurate approximation 
e the fiducial and confidence 
l be outlined below 


te is negative or 


F 


a 
F -1 
pet 


T) __ 1 


025(k —nr,k[n —1]) 


where: 
F = 
K M Spev theory/MSpev st 
he 2.5% F table (Pearson and Hartley, 
is the 


1954)" paren 1D is the entry int 
entry eal n, = k — ny and m = 
orn, = k — nr and n, = ® 


3. Find: 


([n — 1]; and Foasu-mr.») 


F. 
L = E Forsan — 7 


O25(kIn—1].k-n7) _ | 


E Foastktn—11k nr) 
s Fozsto,k nr) i : 
‘ve i i 5% F table form, = Kl = 1h 
025(k{n — is the entry in the IIIA 
n n ge co 4 js the entry for my = ©; and 
£ T3 025(%,k -nr) 


2> 
k ~ ny. 
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4. The upper and lower limits are then Lôĝ and Lé3, respectively. 
With less than 15-20 data points these limits will be found to be 
uncomfortably wide, a fact to bear in mind when designing an experi- 
mental test ofa theoretical model. For example, in Figure 1, with 6 data 
points and two degrees of freedom for curve fitting, the limits might 
plausibly be 0-40, whereas with 14 data points the limits might be 0-12. 


Aside from the considerable variability in the estimate of 0 
which can be reduced by increasing the number of data points, there 
are two other important limitations to the use of estimates of the 
variance of the discrepancies in evaluating a model. First of all, the 
population value of o2, is completely dependent upon the particular 
values of the independent variable, X, which are chosen for the test 
of the model. Choice of two different sets of X’s could well lead to 
two entirely different values of o>, and both of these values could be 
perfectly accurate, Secondly, although c2 gives an over-all index of 
the adequacy of the model being tested, it condenses so much informa- 
tion into one measure that it does not permit pin-pointing the especially 


large D?s so that they can be given proper attention in considering 
revision of the model, 


Estimating the D;. The individual D; may be estimated as points, of 
intervals may be established for the D;, collectively or individually. 
As before, each method has its good points and its limitations. 

Point estimation of the individual D;s consists simply in comparing 
the individual data points, the F., with the fitted curve. It is a crude 
method, but it has served well in the past and represents the beginning 
of wisdom. For example, in Figure 1, the model builder might wel 
note that the first three data Points lie below the curve and ask himsel 
if there 18 some special reason for this. He would also note that the ` 
greatest discrepancy occurs at Y;, where the neighboring discrepancies 
are in the other direction. The weakness of this simple method lies 19 
the absence of a criterion which will assist the investigator in deciding 
Which discrepancies should be singled out for further attention a” 
which may be disregarded because they are within the range of expect 
random variation. This defect is remedied by the interval estimatio? 
techniques. 

Probably the ideal method of interval estimation is that in which 
Intervals are established for the whole curve in one operation 
finding the 95% confidence band. The method takes the theoretic 
Beatie point of departure, and the result is a pair of curves above a 
Ms e theoretical curve, which will tend in the case of random varė 
on to contain between them 95% of the data points. Points lyine 
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outside the band are immediately suspect; they are the most promising 
candidates for attention in the next version of the model. There are 
two practical difficulties with this method. First, homogeneity of the 
error variance, c2, over all the X; is required. And secondly, because 
€rrors in estimation of each fitting parameter must be taken into 
account, for all but the simplest curves (Cornell, 1956, pp. 184-186) 
the bands may be difficult? to obtain. Although the method is elegant, 
in practice it will rarely represent sufficient improvement over the final 
method, given below, to justify its use. 

The last method seems to me to be the most useful and most robust 
and most flexible method. It can be widely applied, and the relative 
ease of application, coupled with its ability to discriminate bewech 
significant and random discrepancies make it superior to the ami 
estimation methods. It also possesses the homely virtue of me 
readily understood. In contrast to the preceding method, this one te s 
as its point of departure the empirical means, and con's ban P 
computing the 95% confidence limits for each of the ei 
homogeneity of variance, the error variance of each ne See 
simply as 62/n; in cases of suspected heterogeneity, each mea eae 

ave its own estimate of error variance. This will, of mis ies 
Variance of the distribution of Y,; for each i, divided by n. ne 
‘mits have been obtained, attention is directed to eae bier 
theoretical curve lies outside the limits. In some cases e m eae 
Might choose to establish the 80% or 90% limits in ae Pa 
attention to less drastic departures of the experimenta eS BEEE a 
Model. Choice of an optimum level for the limits is har Ee 
a general a priori basis, but it is likely that limits pase ni 
traditional 95% will be found more useful than the pees Ro 
imple as this method is, it is hard to improve upon De EET 
ae of giving an almost meaningless eas fects its functioning 
>" a model, it directs attention to specific delea roved, and the 
proves as the precision of the experimental test Is = n AD i, 
Investigator can set the confidence coefficient at pete of 
Sensitivity to defect at a cost of a fairly well-spec! 


3 rameter must be 
A sufficient estimate of the error variance e parameters or else 
yailable and independent of the enna $ E found and the theoretical 
T covariances of all parametric annia T ivatives with respect to the 
function must have continuous first Es a be found in the asymptotic 
Parameters in order that the confidence lait ae is involved in the fitting 
fase (Rao, 1952, pp. 207-208). Where an estimators can rarely 
Of the theoretical function, satisfactory indep 

be obtained. 


3. 
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false positives or wild goose chases. A final and often crucial advantage 
is that the confidence intervals, based as they are upon the experimental 
means, can be obtained in cases where the form of the theoretical 
function does not permit satisfactory estimation of its parameters, 


and the analysis of variance and confidence bands methods cannot 
properly be applied. 


Summary and conclusions 


In this paper I have attempted to show that the traditional procedure 
of testing a null hypothesis (Ho) of a zero difference or set of zero 
differences 1s quite appropriate to the experimenter’s intentions Or 
scientific Strategy when he is unable to predict differences of a specified 
size. When theory or other circumstances permit the prediction of 
differences of specified size, using these predictions as the values in Ho 
Is tactically inappropriate, frustrating and self-defeating. This is 
particularly true when a theoretical curve has been predicted, and Ho is 
framed in terms of zero discrepancies from the curve. If rejection of Ho 
ls 1 i nst the theory, and “acceptance” of Ho 
1s Interpreted as evidence favoring the theory, we find that the larger 


intentions are to detect and correct defects, if possible, so that he can 
more general theoretical model. Because 
ove or disprove a theory but rather to seek 
priate statistical tactics should be those 
than hypothesis testing. 

ative techniques available for point of 
Tepancies between theoretical predictions 
or the over-all variance of these discrepan- 
timation of the confidence intervals for the 
tical curve is the most practical and most 
Procedure. Other writers have recently 
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emphasized the values of various estimation as opposed to hypothesis 
testing techniques (e.g, Bolles and Messick, 1958; Gaito, 1958; 
Savage, 1957) and it is hoped that considerations pointed out by them 
and points raised in this paper will be helpful to investigators who are 
in the process of examining theoretical models which lead to specific 
numerical predictions of experimental outcomes. 
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further considerations on 
testing the null hypothesis 
and the strategy and 
tactics of investigating 
theoretical models 


Arnold Binder 


The arguments in a recent article by Grant (1962) are directed against 
experimental designs oriented toward acceptance of the null hypothesis, 
that is, where Support for an empirical hypothesis depends upon 
acceptance of the null hypothesis. Atkinson and Suppes (1958) 
provide an excellent example of the type of experimental logic to which 
Grant objects. These investigators postulated a one-stage Markov 
model for a zero-sum, two-person game. On the basis of the model 
they predicted, first, the mean Proportion of various responses over 
asymptotic trials and, second, that the probability of State k given 
States i and j on the two previous trials is equal to the probability 
of State k given only State j on the immediately preceding trial (i.e, 
that a one-stage Markov model accounts for the data). The predictions 
were then compared with the obtained results by means of a series © 
t tests, in the former case, and a z? test, in the latter, One of the t tests. 
for example, involved a comparison of the predicted proportion of .60 

against the observed mean Proportion of .605, while another a com- 
Parison of a predicted value of .667 and an observed value of .670- 


one Psychological Review, Vol. 70 (No. 1), 1963, pp. 107-115. Copyright 1963 
Y the American Psychological Association. Reproduced by permission. 
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Support for the one-stage Markov model was then inferred by the 
failure of the ¢ tests and the z? to reach the .05 level of significance. 
That is, support for the empirical model came from acceptance of the 
null hypotheses. Other examples may be found in Binder and Feldman, 
1960; Bower, 1962; Brody, 1958; Bush and Mosteller, 1955; Grant 
and Norris, 1946; Harrow and Friedman, 1958; Weinstock, 1958; 
and Witte, 1959, 

To facilitate future discussion it is convenient to refer to the 
Procedure where acceptance of the null hypothesis leads to support 
for an empirical hypothesis as acceptance-support (a-s), and to the 
Procedure where empirical support comes from rejection of the null 
hypothesis as rejection-support (r-s). 

In addition to the objections to a-s, Grant argues that the method of 
testing statistical hypotheses may not be a very good idea in any case. 
He thus argues it is wise to shift away from the current emphasis in 
Psychological research on hypothesis testing in the direction of 
Statistical estimation. 


Statistical logic 
t in regard to the 


There hz zid ls of though 
re have been two principal schoo erenc. Theol 


gical and procedural ramifications of statistical in ‘ 
Of these sige from the writings of Yule, Karl Pearson, and Fisher, 
While the other comes from the early work of Neyman and Pearson 
and the more recent developments of Wald. The respective meee 
of each of these schools on experimental statistics 1s abundan y 
evident, but a difficulty in separating these influences is that the aen 
recommendations for tests and interval estimates in a field like psy 


chology are similar for both. 7 i 
In the Fisher school one starts the testing process yah apota 
called the “null hypothesis,” which states that he mn e EnA n'A 
rom a hypothetical population with a sampling dis eke kenal 
pertain known class. Using this distribution one tattle and the 
YPothesis whenever the discrepancy between the au e that the 
Televant parameter of the distribution of interest 1s SO R less than 
Probability of obtaining that discrepancy or & larger en statement 
he quantity designated « (the significance legel), No is chosen, but 
'S Provided for the manner in which the null Deeps pe in the form 
© tests with which Fisher (1949) has been associa t “the phenome- 
Vhere the null hypothesis is equated with the statemen 
On to be demonstrated is in fact absent” (p. ee is” is therefore un- 
The concept “rejection of the null hypot on seit Shah abi 
«, uous in the context of Fisher's viewpol 1949) provides the 
*Xceptance of the null hypothesis?” Fisher ( 
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following statement “the null hypothesis is never proved or established, 
but is possibly disproved, in the course of experimentation. Every 
experiment may be said to exist only in order to give the facts a chance 
of disproving the null hypothesis” (p. 16). This is not very edifying 
since one does not expect to prove any hypothesis by the methods of 
probabilistic inference. Hogben (1957) has interpreted these and similar 
statements of the Yule-Fisher group to mean that a test of significance 
can lead to one of two decisions: the null hypothesis is rejected at the « 
level or judgment is reserved in the absence of sufficient basis for 
rejecting the null hypothesis. 

Papers by Neyman and Pearson (1928a, 1928b) pointed out that 
the choice of a statistical test must involve consideration of alternative 
hypotheses as well as the hypothesis of central concern. They intro- 
duced the distinction between the error of falsely rejecting the null 
hypothesis and the error of falsely accepting it (rejecting its alternative). 
Neyman and Pearson’s (1933) general theory of hypothesis testing, 
based on the concepts Type I error, Type II error, power, and critical 
region, was presented later. 

The Possible parameters for the distribution of the random variable 
or variables In a given investigation are conceptually represented by @ 
set of points in what is called a parameter space. This space is con- 
sidered to be divided into two or more subsets, but we shall restrict our 
Present discussion to the classical case in which there are exactly 
two subsets of points, 

___ The statistical hypothesis specifies that the parameter point lies 
ina particular one of these two subsets while the alternative hypothesis 
specifies the other subset for the point. A statistical test is a procedure 
or deciding, on the basis of a set of observations, whether to accept OF 
reject the hypothesis. Acceptance of the hypothesis is precisely the 
same as deciding that the parameter point lies in the set encompasse 

by the hypothesis, while rejection of the hypothesis is deciding that 
the point lies in the other subset. A typical test procedure assigns tO 


each possible value of the rando i isti wo 
3 : m variable °) one of the t 
possible decisions, eas 


Sets of distributions (or their associated parameters), in this 


pe ne malical model, may be considered to correspond to the explana- 
ions in the empirical world which may account for the possible 
outcomes of a given experiment. Empirical hypotheses, which spe! y 
values or relationships in the scientific world, are translatable on this 
basis into statistical hypotheses. But the distinction between empirica! 
and statistical hypotheses is quite important: the former refer tO 
scientific results and relationships, the latter to subsets of points in # 


Parameter space; they are related by a set of correspondences between 
scientific events and parameter sets. 


H 
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The term “null hypothesis” does not occur in the writings of many 
of the advocates of the Neyman-Pearson view. Except for one pejora- 
tive footnote I was unable to find the term used by Neyman (1942), 
for example, in any of an extensive array of his publications. In 
general, these people prefer the term “statistical hypothesis” or simply 

hypothesis” in designating the subset of central concern and alternative 
hypothesis for the other subset. However, null hypothesis has taken 
on meaning over the years in the context of the Neyman-Pearson 
tradition among many writers of statistics, particularly those with 
expository proclivities. In the Dictionary of Statistical Terms (Kendall 
and Buckland, 1957) we find the following definition for null hypothesis : 

In general, this term relates to a particular hypothesis under test, as 
distinct from the alternative hypotheses which are under consideration. 
It is therefore the hypothesis which determines the Type I Error” 


(p. 202). 


An evaluation’ 
new or novel since it has 


Grant’s position in regard to a-s is certainly not 
st 25 years. Moreover, 


been implicit in the writings of Fisher for the pa 


1. There is a third viewpoint, represented in the psychological literatur xy 
Rozeboom’s (1960) recent article, from which Grant’s position could be 


evaluated. This viewpoint emphasizes the impona of the a ae 
abiliti i lanations, in the Bayes sense, rather | 
probabilities of alternative exp ST aoe 


decision aspects of experimentation. However, the philoso] à 
practical Problems of this approach remain enormous as 1S evia re 
debates on this and related topics over the years. See, a Schad Meee 
(1957), Neyman (1952), Hogben (1957), Savage (1954), Che! 


i i ly Parzen (1960) who discusses 
(1959), von Mises (1942, 1957), and partic ila ity in applied problems. It is 


the dangers of using Bayesian inverse p! knon 
typically not the case in basic research that one canaine A AEn 
parameter is a random variable with some speci i vide an adequate 
and in such cases this approach does not presently pro i y 


hypothesis evaluation. d 
pas the pe erent philosophical persuasion than the 
De Sh eae ) is equally unsympathetic with the inferen- 


r i oom (1960) l 1 
Paar Grant. He cuts into an essential component of this 
bias in the following succinct and effective eia A a 

ik ceive the null hypothesis 
Although many persons would like to conceive ^- 
decision ae testing to authorize only rejection of the ate not, 
in oulttioh its acceptance when the test statistic fails to fall in the rejection 
region, if failure to reject were not taken as grounds for acceptance, then 
NH D, rocedure would involve no Type IJ error, and no justification would 
be gine for taking the rejection region at the extremes of the distribution, 


rather than in its middle (p- 419). 
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i the folklore of statistical advising in psychology 
epee: as my initial exposure to psychological gone 
(see Footnote 2). And, in fact, if Grant wishes to argue that a ae 
holds only in the very narrowest interpretation of the | n 
Pearson-Fisher structure, I see no grounds for contesting it. a 
are only two possible decisions—reject the null a meow phe peat 
judgment—one would surely not wish to equate the null hypo iG 
with the empirical hypothesis designating a specific value. Using a 
logic an investigator could just as well eae as retain a theory w 
i led to perfect predictions over a wide range. 
$ a this Context T would like to point out that there are ma 
logical difficulties connected with the Fisher formulations which Ta 
been brought out dramatically in years of debate (Fisher, 1935, J 
1955, 1959, 1960; Neyman, 1942, 1952, 1956, 1961). Moreover there 
are some people who, while generally sympathetic with the iser 
viewpoint, are quite willing to accept the null hypothesis and eontu : 
that this provides support for an empirical hypothesis (Mather, 1943; 
Snedecor, 1956). a 

In the pursuit of evaluating Grant’s position from the Neyman 
Pearson theory we must remember that the null hypothesis is a 
Statistical hypothesis which designates a particular subset of parameter 
points. Moreover, the null hypothesis and the alternative hypothesis 
(the other subset) are mutually exhaustive so that rejection of the one 
implies acceptance of the other; acceptance of a hypothesis being the 
belief, at a certain probability level, that the subset specified by the 
hypothesis includes the parameter point. There can be no question 
about the legitimacy or acceptability of acceptance of the null hypoth- 
esis within this purely mathematical scheme since acceptance an 
rejection are perfectly complementary. 

Consequently any interpretive difficulties which result from 
accepting the null hypothesi 
relating empirical and Statisti 


cal (null) hypotheses. The null hypothesis 
is of course that hy, 


> f op us 
pothesis for which the probability of Te 
rejection is fixed at % (or set at a maximum of g); the test ane 
region) is chosen so as to maximize power for the given a and 


è à be M SS 
alternative hypothesis. Since therein lies the only feature of the proces 
that differentiates the null h 


of empirical and statistica] 
While there are no firm 


: meet toll 
s must be in the rules for or manner 


A. Binder — 313 


wi . oa 
pae enean hypothesis that empirical hypothesis for which the error 
acceptance i ee 1S more serious than the error of erroneous 
Githe ax Aa that the more important error is under the direct control 
Pernt) pete There area few other conventions based upon the 
eet > vantages of fixing a fora simple (rather than a composite) 
Brey Ee . it is quite clear that Grant has not merely restated 
ee se. In fact, Grant's (1962) strong statement that “using these 
on aS as the values in Ho [the null hypothesis] is tactically 
hie opriate, frustrating, and self-defeating,” (p. 61) indicates that 
on is much more than a convention of convenience. 
is Arcos aa which I will develop over the remainder of this paper 
eee at a-s is preferable to r-s, but that there are no sound founda- 
s for damning a-s. In this process let me initially point out that one 


ca $ ; 
n be led astray unless he recognizes that when one tests a point 
ample element is drawn 


pa eron he usually knows before the first s 
h at his empirical hypothesis is not precisely true. Consider testing the 
ypothesis that two groups differ in means by some specified amount. 


We might test the hypothesis that the difference in means is 0, or perhaps 
case we are certain that the 


a perhaps even 122.5. But in each 
es is not precisely 0.0000... ad inf, or 12.0000..., oF 

22.50000 . . . ad inf. 
i Recognition of this state of affairs leads to thinking in terms of 
ifferences or deviations that are or are not of importance for a given 
Stage of theory construction or of application. Some express this in 
portance, 


terms of differences which do and do not have practical im 
ce which is used with important 


but I prefer the term zone of indifferen his 

implications in sequential analysis. That 1s. if, for example, the 

difference in mean performance be is less than, say, 

ah two means may be consi 
retical development. In t 

s in a maze, one would expect 


f : š 
or the proportion of right turns of rats it 
Same courses of action to be followed if the figure were 


or .335. Thus, although we may specify a point null hypothesis for the 
Purpose of our statistical test, We do recognize a more oF less broad 
indifference zone about the null hypothesis consisting of values which 
are essentially equivalent to the null hypothesis for our present theory 
Or practice. f Vi 

While the formal procedures for testing statistical hypotheses are 
based upon the assumption that the sample size (n) is fixed prior to 
Consideration of alternative test procedures, the user of statistical 
techniques is faced with the problem of choosing ” and does so with 
regard for the magnitude of the discriminations which are or are not 
important for his particular application or level of theory development. 
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In the typical case we choose the conditions of experimentation, 
including sample size, such that we will reject the null hypothesis 
with a given probability when the parameter difference is a certain 
magnitude. This is frequently done very formally in fields like agricul- 
ture, although rather informally in psychology. For example, in 
Cochran and Cox (1957) there is an extended discussion of the pro- 
cedures for choosing the number of replications for an experiment on the 
basis of the practical importance of true differences. Thus in one of 
their examples, a difference of 20% of the mean of two values is 
considered sufficiently important to warrant a sensitive enough 
experiment to have an .80 probability of detecting it; that is, if the 
difference is 20% a large enough n is desired to insure that the power 
of the test is 80. Although it may happen that the required sample size 
is a function of an unknown distribution and not determinable in 
advance, it can usually be approximated with the tests used most 
frequently by psychologists. 
F ee of sample size is but one feature in the overall planning 
far the 1 experiment of the desired precision with due consideration 
or the level of theory development (including alternate theories), the 
2 weed pee tween and the related consequences of decision. How- 
ah me pe ion as the standard error per unit observation and 
None eee do not have the flexibility of sample size, and, 
Ecohomy. ThE thoi ly chosen to maximize precision for reasons of 
mental strate ies oice of optimum sample size applies to all experi- 
Regeln. tes S, nending the nonobjectionable (to Grant) and more 
significant differenc 4 belly ec Te anyone whio! wanti ta; obtain 
consideration is ihe ‘ong nousa can obtain one—if his only 
icc eo ae that significant difference. Accepting that 
difference Bewe thee Fe pronps are never perfectly equal, the 
PaT aaa al 1s some value c. It is obviously an easy matter 
the null hypothesis Ee enough, for the ¢, such that we will reject 
so slight as to have no ate oom Pe theditierense mey ie 
given stage of measure Pacal or theoretical consequences for t 
(1960) h ‘ement and theory construction. As McNemat 
as recently pointed out, in his objections to the use of extreme 
groups, significant differences may be obtained even when the under 


lying correlation is as low as .10 which impi; i i 
variance equal to .01. ‘10 which implies a proportion of predicte 


After arguing against a-s on the basis of the dangers of tests that 
tend toward leniency, Grant points out that the procedure may be 
equally objectionable when the test is too stringent. He illustrates the 
latter by an example ofa theory which is useful though far from perfect 


in its predictions. This particular point is perfectly in accord with my 


A. Binder 3 1 5 


arguments since it demon i 
mene strates the parallelism of a- ; 
Ee. D wane aes that is toe cae : ae r 
theory itr esirable to reject a useful, though i 4 i 
aer ey a 
edienteri econ want an experiment i 
ane Sedation In a-s because one may acter an eee 
tone es Sg i theory; in r-s because it may not be desirable 
i cathests ae u ough inaccurate theory (that is, to accept the null 
terms were bee Sa rejection of its alternative). The identical 
implications fer pe t n preceding sentences to dramatize the parallel 
iisthen-tees are 2 an r-s of the general desirability of a test that is 
ment is precise i n Re too insensitive. Whether or not the experi- 
ieee E is, then, a function of theoretical and practical 
hispothieste EIR not of whether acceptance or rejection of the null 
But toy 3 to: support for an empirical theory. 
aboe tans e. argue, while there is logical equivalence as stated 
that eall | not motivational equivalence. That is, while it is agreed 
ally investigators design their experiments (including their 


choice 
of sz i i 
sample sizes) in order to be reasonably certain of detecting 
al importance, in 


only dif 
iff F 3 A 
actual tition t which are of practical or theoretic: 
‘actice they are neither so wise nor so pure as to be influenced 
ions. And it is 


by th 
5 nese fe A f 3 
e factors to the exclusion of social motivat 
r than precise experimenta- 


ind i 
gee easier to do insensitive rathe 

is dears henomenon is of course what Grant (1962) referred to in 

nt, 

and rejecting Ho as disproof of a 
Its that a small-scale, insensitive 
favoring a theory, whereas a 
Id results opposed to the 


The 2 

Dan eaa, accepting Ho as proa, 
experiment w the anomalous resu 
arge-scale Me most often be interpreted as fav 
theory! o a experiment will usually yie 


Perhag 

ce reflects the essential point of Grant's 

Personal E imprudent experimenters that the 

Sensitiy esire to establish one’s hypothesis and the ea 
i e experimentation produce a particularly trou 


action, 

fe proceeding it should be remembered that scientific con- 
€or ions may be made secondary to personal desires to establish a 

fashi Y whether the procedure be a-s or r-s ina perfectly analogous 

on. The only difference involves such practical considerations 

ts than 100 or 500. 


as the : 
fact that it is usually easier to run 5 or 10 subjec | : 

f Grant (1962) merely intended his article to convey this obvious 
ions which involve such 


Warning I $ 
, I cannot understand the discuss! 


presentation— 
combination of 
se of performing 
blesome inter- 
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statements as the following: 


P acy 
Unfortunatel Y most of the procedures used to date in testing the adena 
of such theoretical predictions [ from mathematical models] s has been 
bad examples. Probabl y the least adequate of these proce ri an 
that in which an Ho of exact correspondence between cheora between 
empirical points is tested against H 1 Covering any discrepancy be 
predictions and experimental results (p. 55). 


r me a Š nes s rather 
If one is Pointing out the dangers of using insensitive a-s est to 
than condemning a-s on logical grounds), one would be exp 


As I see 
nothing but a particular form eri- 


: g -5 eX 
bad experimentation. Itis unquestionably the case that an a-s € ba a 
ment that is too small a 


Grant’s position from the 
viewpoint of Scientific 
development 


Y : i size 
In the process of concluding this discussion I bere eine g 
and expand on Certain factors which seem most a ie a E 
ofevaluating scientific theories, as well as to indicate yobj 
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to Grant’s a iti iusti 

Neyman-Pe ae ang are justifiable beyond the confines of the 
tients td clear that at various phases in the development of a 
suitability of ae is faced with the problem of deciding about the 
ahtouelapient eee theories. When a discipline is at an early stage 
poet amas nowledge of empirical relationships is crude so that 
obtainable reas explanatory constructs may be the most that is 
Blishwniekt Then : is stage one might consider as a significant accom- 
chance pheno u ing out of the hypothesis that observed differences are 
bë that e eaa The empirical hypothesis of central concern would 
era is some relationship of unknown magnitude, while its 

ive would be the chance or noise explanation. 


Rone. increasing sophistication 1n the discipline the alternative 
es may represent different, but more or less equally well- 
between theory and chance, 


de $ 
A o lthepries One does not choose 
Another a n theory and theory or between theory and theories. 
Precision aspect of increased sophistication is frequently the greater 
The ee the prediction of empirical results for the various theories. 
itday ne as to which of the theories is admissible on the basis 
Pearson a able data may be accomplished directly within the Neyman- 
the cho; ramework, but that is not necessarily the case. Sometimes 
hpa among theories depends upon a succession of tests of 
an ex eses or possibly even upon quite informal considerations; as 
Sees of the latter, one theory may lead to a prediction which is 
while til in accord (within rounding errors) with the observations 
be o e other theory is off by quite a margin—a statistical test would 
a ere foolish indeed. In disciplines that have markedly 
Dros a observational variability than psychology the most common 
ais €dure consists of a subjective comparison between predictions and 
servations. Moreover, the point that one chooses among alternative 
hypotheses at various stages of scientific development (whether by 
Statistical methods or otherwise) most certainly does not imply that 
his efforts stop once he has accepted or rejected a given hypothesis 
as Grant implies; if the accepted theory. for example, is of any interest 
he proceeds to make finer analyses and comparisons which may range 
from orthogonal subcomparisons in the analysis of variance to 
intuitive rumination. This provi ng to Grant's 
arguments to the effect that hypothesis testin 
supplemented) by estimation. The point is t 
at different phases of investigation. 
__ Twill again refer to the Atkinson and Suppes (1958) experiment to 
illustrate the relative roles of hypothesis testing and subsequent 
analysis in scientific advancement. Their first strategy was to decide 
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which of two theories—game theory or the Markov model—was most 
adequate in the given experimental context. This clearly was a 
problem of testing hypotheses; a choice had to be made and the 
procedures of estimation could at best provide a substage on the way 
to the decision. The Markov model was accepted and game theory 
rejected, as noted above, but this certainly did not lead to a cessation 
of activity. Instead the investigators initially compared theoretical and 
observed transition matrices (and found them distinctly different), 
they then tested the more specific hypothesis of a one-stage Markov 
model against the alternative of a two-stage model, and finally they 
investigated the stationarity of the Markov process. 

During its early phases, Einstein’s general theory of relativity was 
equivalent to Newtonian theory in the success of explaining various 
common phenomena and a choice between them could not be made. 
But the Einstein theory led to certain predictions differing from 
Newtonian and these in turn led to a series of “crucial” tests. Among 
these were the exact predictions as to the magnitude of the bending 
ofa light ray from a star by the gravitational field of the sun and the shift 
of wavelength of light emitted from atoms at the surface of stars. 
The general theory of relativity, thus, led to predictions which differed 
from the predictions of the alternative theory (Newton’s), and the 
ultimate Correspondence between these predictions and empirical 
‘observations (acceptance of no difference between predicted an 
obtained results) led to support for general relativity. While agreements 
between theory and observational results have been close they certainly 
have not been perfect—even physicists have problems of measurement 
precision and intricacy of mathematical derivation, But to the best 
judgment of the scientists the closeness of the fit between predictions 
and observations warrants the conclusion that the data provide 
Support for the theory. Surely, however, despite its tremendous 
power, physicists do not claim that Einstein's general theory has bee? 
Proved nor are they convinced that it will not be ultimately replaced 
by a better theory, 

E does not seem reasonable to argue that this method of scientific 
procedure is not suitable for psychology—just because our measure” 
ment precision happens to be lower than in physics and we US? 
Statistical tests rather than purely observational comparison. 
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a note on the 
inconclusiveness of 
accepting the null 
hypothesis 


Warner R. Wilson 
Howard Miller 


: tility of 
tly discussed the u 0 

Gra 2 inder (1963) have recen 1] hypothesis 
a e on the acceptance of = Ey ene to 
o) Tersi basing support on the rejection O A leads to support for a 

refer to the procedure where acceptance ofa Ho 


here support 
the rocedure W > 
theory as acceptance-support (a-s), a A i rejection support (r s). 


oe Panes sere the clarification of the issue, but 
Both writers have contribute z 
i in. 2 iscus: e 
ambiguity and disagreement still rem pe desirable to discuss som 
Before discussing the problem it maY he Jead of Grant (1962) in 
ithe term tobe used This paper ieee that no difference or no 
i 3 ppositi the eng i null 
viewi 2 osition to mingly the term nul 
relation ie $ i Setiat no effect is Cerin Grant (1962) used it 
thes aay sed in this way, and certa therefore, in this sense, 
ie esis is often u first stated the issue. It is ne robability estimate 
a a ont ahem to whose rejection aone the Yule-Fisher school 
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x ; ol 

Pearson either reject the Ho or accept it.’ The yen ee f 
would seem more compatible with a-s since in a-s one either 9 R 
Ho or withholds judgment. Likewise, the A E Ho, 
seems more compatible with a-s since it allows acceptance wi ae 
and, in a-s it is the acceptance of the Ho that provides a reward 
theory which predicted no difference in the first place. diete 1e 
any reasonable person would concede that if the theory pre omid 
difference, the failure to find a significant difference gives some c elore, 
to the theory, at least relatively speaking. It would seem, ther 
that it makes little difference which of these views one takes. ses te 

This paper suggests that the issue of a-s versus r-s cme, 
broader than Grant and Binder indicate. It is hoped that the PEFD. 
perspectives presented here will add further clarification E TG 
provide grounds for the resolution of the issue. Grant and i Hoa 
discuss primarily a strategy decision which must be made in re! cere 
to the analysis of data. This Paper points out that a similar deci 
must be made in relation to the collection of data. 


Grant (1962) mak 
Psychology at least, t 
accuracy of his theor 
between theoretical a 


d, this 


PC a po mente! 
1. Binder's (1963) article provides a more detailed and well-docu 
discussion of these two viewpoints. 
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result usuall indi : 
ative to ee) marai that the experiment was not sufficiently 
Certainiyithe e imperfections in the theory that are almost 
likely it is E Further, the more insensitive the experiment, the more 
tec ms. ho support” a theory even if the theory is poor. When one 
een wever, insensitivity works against acceptance of a theory. 
EA ora it is also difficult to give a satisfactory interpretation to a 
a sto ci a-s because, if an experiment is very sensitive, it may 
Seekers h a theory which is really fairly good. When an experimenter 
5 ies a bah sensitivity increases his chances of obtaining support. 
E E . cf. p. 111) claims that parallel objections can. be made 
TEO =e arsa may fail to support a good theory if the experiment 
e asiye and, at the same time, if the experiment is very 
pie ‘en an r-s-a may lend support toa worthless theory, that is, 
Binder d as negligible predictive utility. Grant’s arguments, says 
vö te = o aer add up toan objection against an a-s strategy ; they add 
Stoa wits jection against experiments that are either too insensitive 
sent ee Experiments whose sensitivity 1S appropriate to the 
Siperia a of theory construction and application will enable an 
matt nter to accept helpful theories and reject useless ones no 
er which type of analysis he applies. 
pe this last point may be sound enough logically, it does not 
cee practically, since it would seem to be rather difficult to 
a Ish any workable formula for deciding what is too precise or 
F precise in a given setting. Better to have a strategy that protects 
gainst the more serious type of error even if optional sensitivity is not 
maintained. 
a oe (1963) does make a c 
ur attention that we can ma 
e can (a) accept a false theory; ( 


ontribution, however, in that he brings 

ke several different kinds of mistakes. 

a b) withold judgment about or reject 

to theory; or (c) accept a “true but poor” theory (as discussed by 
inder, 1963, pp. 111-112). It may be helpful to note the relation 

He i choice of a-s-a or r-s-a, use of sensitive or insensitive experiments, 
nd the type of error one is subject to: 


s-a leads to Error Type a 


Ina Precise experiment a-s-4 leads to Error Type b 


In an imprecise experiment r-s-4 leads to Error Type b 
a leads to Error Type ¢ 


Inani : ; 
an imprecise experiment a- 


I j 
Na precise experiment r-s- 
The choice between a-s-a and r-s-4 would seem to hinge on an evalua- 
tion of the seriousness of these different mistakes. Traditionally, 
Type b. Campbell 


Ype a is considered much more serious than 
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(1959) has presented a very sophisticated justification of this time- 
honored assumption. Ordinarily experimenters pay no attention to 
Type c; most experimenters are content to base r-s on a significant 
t or F and to worry not at all if the effect is so small as to be negligible. 
It is apparently quite rare in psychology for an experimenter to insist 
that a difference be larger than some minimum before it is accepted as 
important. And, many psychological theories are able to predict 
only that differences, of unspecified magnitude, should or should not 
occur. Some even argue that small effects may be very important from 
the viewpoint of theory development and maintain, in effect, that 
when one reports a significant but very small effect he is not making an 
error at all but being highly virtuous. Nonetheless, it may be unwise to 
give so little heed to the size of effects, that is, to Error Typec. It could 
be argued that psychologists may be spending a great deal of their time 
studying variables better viewed as irrelevant. This would seem to be 
morea fault of interpretation of a detected relationship than its detection 
itself. Finding a relationship is still a virtue while misusing it may not be. 
One could even argue that Error Type a, acceptance of a false 
theory, is much the same as Error Type c, acceptance of a theory that i$ 
true as far as it goes but which is able to predict only a negligible part 
of the variance in the data. Indeed, Binder presents much of his defense 
of a-s on the fact that while a-s may lead to a, r-s has a corresponding 
weakness in that it may lead toc. Even if it is granted that Error Types 


aand c are similar in their con isd ion that 
sequences—and this is a concession t 
psychology as a whol 


-s analysis. For one thing, human nature 
imenters will probably be tempted to do 
Precise experiments. One might argue 


worthless. 
. ; ` : 
These various considerations would seem to support Grants 
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contention that the usual acceptance-rejection tactics might well be 
replaced or at least supplemented by a quality control procedure 
which would indicate not merely if theories predict but how well. 
The above considerations would not, however, seem to support 
Binder’s contention that a-s and r-s expose the experimenter to com- 
pletely parallel and equal serious pitfalls. Indeed, in light of the various 
stated reasons for viewing Error Type a as more serious than Error 
Type c (which may not even in itself be an error), it would seem that an 
evaluation of the various considerations Binder raises would lead to a 
preference for an r-s rather than an a-s strategy. _ : 

Perhaps some different perspectives can clarify further the basis 
for this preference. Suppose the same set of data provides both types 
ofsupport or neither. In such cases it would make little difference which 
analysis was applied. But suppose the data provide r-s but not a-s. 
Such data can be meaningfully interpreted as lending the theory 
some support as Grant (1962) has pointed out. The critical question p 
“What are the implications of accepting a-s when r-s cannot s 
claimed?” It is suggested that a-s without r-s means nothing. ‘a 
result can seemingly only occur when the experiment is too ine 
to justify any conclusion. Although an a-s analysis may so 1960) 
be a useful or interesting supplement to the r-s analysis, aa ats 
Seems justified in maintaining that a-s alone is not enough. i o aa 
this conclusion in another way: 4-5-4 Is adequate in some s! poser 
but in all these cases r-s-a will be adequate also, however, rec 
Other situations, which apparently cannot always be anea peas N 
will be adequate but a-s-a will lead one astray. Hence N o pera 

Thus far this paper has considered only man T erei 
menter is comparing the fit between data points reductions, el 
expectations that take the form of exact Ge ean ear cae 
the correspondence between an empirical and PI be taken to the 
It is hopefully clear that two different PRE i nalysis or the rejection- 
analysis of such data, the acceptance-suppor a 


Support-analysis. 
Obviously many, if n , € i 
involve exact predictions. Often it is only 


that is predicted, e.g., the experi a E gront: o heiste raei 
greater than, or perhaps equa! to 


itati diction 
b ‘ i in the context of exact quantitative predicti 
ae ee experiments that test ma M 
Only whether groups are Jess than, greater ilian pi a e ee 
It is suggested that this second context also a ow de e 
adopt an a-s or an r-s strategy and that the arg 


are just as compelling in this situation. 


ments in psychology do not 
the directionality of results 
pected to be less than, 


ot most, experi 
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It is true that when prediction is not exact, the kind of analysis 
applied will almost necessarily be analogous to what has been discussed 
as r-s-a. Suppose one predicts that the experimental group will exceed 
the control group. The Hg states no difference. Support for the 
prediction comes from rejecting this Hy. Application of an a-s-a, in 
this context, would involve conceptualizing the Hy as follows: “The 
groups do not differ significantly from the rank order predicted.” 
A-s is obtained unless the groups differ significantly in the direction 
opposite that predicted. One, of course, is not likely to think of an 
a-s-ain this particular context because, in this context, a-s is ridiculously 
easy to obtain. Perhaps, however, this fact only lends support to the 
assertion made earlier—a-s without r-s means nothing. 

Even when predictions are only directional and an r-s-a is used, 
the experimenter can still adopt an acceptance versus a rejection support 
strategy by choosing to design his experiment so that his theory 
predicts no differences versus some differences. It will be convenient 
to refer to the no-difference experiment as an a-s-design (a-s-d) and 
to the some-difference experiment as an r-s-design (r-s-d). If the 
experimenter chooses an a-s-d, he can claim support for his theory 
if no differences are found and at the same time present his negative 
results as evidence against any rival theories that do predict a difference. 
Rock and others (Rock, 1957; Rock and Heimer, 1959; Rock and 
Steinfeld, 1963) have used just such a strategy. If the experimenter 
Wishes to use an r-s strategy, he adopts an r-s-d; and if differences are 
obtained, he claims support for his position and evidence against any 
theories that predict no difference, Wilson (1962) has used just such a 
Strategy. The experimenter may, of course, change strategies in 
midstream. For example, if an experimenter fails to get r-s, he may 
adopt a new theoretical position that predicts no difference and cite 
his data as giving a-s to it. When Thorndike reformulated his law of 
effect on the basis of negative results, he essentially changed from an 
hs to an a-s strategy. Seemingly, an experimenter could, in theory; 
Start out without a theory and formulate one post hoe, in which case he 
might claim either a-s or r-s. 3 

The objections made to a-s-a seem to apply with equal or greater 
force to a-s-d. In both cases imprecision increases the danger O 
accepting an erroneous theory, whereas both r-s-a and r-s-d protect 
Cae erroneous conclusions even in imprecise experiments. 

an Pe important and compelling point would seem to be that 
cae feat cases we are using Statistics because empirical hypotheses 
ith ane ee acceptance (or rejection). Rather they are accepted 
am or less confidence depending on the adequacy of the 
ence. Indeed, the main point of inferential statistics would seem 
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to a ili 
To ae teeny S 
E E n y degree of correspondence or 
probabil e in psycho ogy), we are only able to attach such 
i ility statement to the rejection of the null hypothesis. For 
ample, in a directional empirical prediction we can say that 1 or 5 By 
of the time (as we choose) we will be wrong in rejecting the null 
hypothesis on the basis of such data as these. We cannot, however. 
teh say what the probability is of falsely accepting it except for 
the unlikely event in which we can specify a necessary magnitude. 
Therefore, only an r-s strategy will supply an exact estimate and an a-s 
strategy allows one to “accept” his hypothesis only in the negative 
sense of having found no evidence against it. This does not seem to be 
very satisfying grounds especially when error variance is large. 
A final consideration has to do with the nontheoretical value of 
the information obtained from an a-s-d vs. an r-s-d. It is perhaps not 
undue to ask of an experimenter what value his information has other 


than its implications for choosing between theories. It may be noted 
that a-s-d seem to commit one to the cataloging of ineffective variables 
effective variables and 


and procedures. Very likely a cataloging of 
procedures, which is the logical result of r-s-d, would be more useful. 


Summary 

A choice can be made between an accept 
support strategy when prediction is merely directional as well as when 
it is exact. When exact quantitative predictions are derived, one 
chooses an acceptance-support vs. a rejection-support strategy by 
choosing an acceptance-support analysis vs. a rejection-support 
analysis, When only the directionality of the outcome Is predicted, 
a rejection-support analysis is almost sure to be used, but one can 
still make the choice of an acceptance-support vs. a rejection-support 
Strategy by choosing an acceptance-support design vs. a rejection- 
support design. In both cases rejection-support seems to be the better 
strategy, especially in that it enables one to minimize and quantify the 
danger of accepting an erroneous theory even 1n imprecise experiments. 


ance-support and a rejection- 
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tactical note on the 
relation between 
scientific and statistical 
hypotheses' 


Ward Edwards 


Grant (1962), Binder (1963), and Wilson and Miller (1964) have been 
opriate relationship 


eo ting the question of what should be the appr 
; etween the scientific hypotheses or theories that a scientist is interested 
n and the statistical hypotheses, null and alternative, that classical 


Statistics invites him to use in significance tests. Grant rightly notes 
null hypothesis puts a 


that using the value predicted by a theory as a 
Premium on sloppy experimentation, since small numbers of observa- 
tions and large variances favor acceptance of the null hypothesis 
and “confirmation” of the theory, while sufficiently precise experi- 


Mentation is likely to reject any null hypothesis and so the theory 
heory is very nearly true. Grant’s 


associated with it, even when that t > 
major recommendation for coping with the problem is to use confidence 
intervals around observed values; if the theoretical values do not lie 
within these limits, the theory is suspect. With this technique also, 
63 (No. 6), June, 1965, pp. 400-402. Copyright 
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the United States Air Force under Contract 
ics Systems Division, Air 


pom Psychological Bulletin, Vol. 
965 by the American Psychologic: 
L: This research was supported by 
AF 19 (628)-2823 monitored by the Electroni 
Force Systems Command. 1am grateful to L. J. Savage, D. A. Grant, and 


W. R. Wilson for helpful criticisms of an earlier draft. 


329 


330 Research Problems in Psychology 


sloppy experimentation will favor acceptance of the theory ee 
least the width of the intervals will display sloppiness. a 
suggests testing the hypothesis that the correlation between Ss a 
and observed values is zero (in cases in which a function rather tha me 
point is being predicted), but notes that an experiment of Peace 
precision will nearly always reject this hypothesis for theories ole i 
very modest resemblance to the truth. Binder, defending ates 
classical view, argues that the inference from outcome of a stanen 
procedure to a scientific conclusion must be a matter ol jada 
and should certainly take the precision of the experiment into account, 
but that there is no reason why the null hypothesis should not, ice 
an experiment of reasonable precision, be identified with the scienti : 
hypothesis of interest. Wilson and Miller point out that the i ea 
concerns not only statistical procedures but also choice iene, 
prediction to be tested, since some predictions are of differences a 

some of no difference. Their point seems to apply primarily to ek 
formulated theories, since Precise theories will make specific numerica 


cia ` n 4 reat 
predictions of the sizes of differences and it would be natural to trea 
these as null hypothesis values, 


Edwards, Lindman, and Savage (1963), in an expository paper On 


Bayesian statistical inference, have pointed out that from a Bayesian 
point of view, classical procedures for statistical inference are always 
violently biased against the null h = SHON 
that is actually in favor of the null hypothesis may lead to its ia: 
by a properly applied classical test. This fact implies that, other thing 
i k better in the light of experimenta 
data if its prediction is associated with the alternative hypothesis than 
the null hypothesis. ical 
For a detailed mathematical exposition of the bias of dase 
significance tests, see Edwards, Lindman, and Savage (1963) ee 
Lindley (1957), Lindley has proven a theorem frequently illustrated i 
wards, Lindman, and Savage (1963) that amounts to the ele 
An appropriate measure of the impact of evidence on one hypothe 
against another is a statistical quantity called the likelihood T iat 
Name any likelihood ratio in favor of the null hypothesis, no ma an 
how large, and any significance level, no matter how small. Data ait 
always be invented that will simultaneously favor the null hypothesi 
by at least that likelihood ratio and lead to rejection of that hypothe 
at at least that significance level. In other words, data can awaa 
invented that highly favor the null hypothesis, but lead to its sg 2 ; 
by an appropriate classical test at any specified significance ee 
That theorem establishes the generality and ubiquity of the a 
Edwards, Lindman, and Savage (1963) show that data like those fou 
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in psychological experiments leading to .05 or .01 level rejections of 
null hypotheses are seldom if ever strong evidence against null 
hypotheses, and often actually favor them. 
__ The following example gives the flavor of the argument, though it 
is extremely crude and makes no use of such tools as likelihood ratios. 
The boiling point of statistic acid is known to be exactly 50°C. You, 
an organic chemist, have attempted to synthesize statistic acid; in 
front of you is a beaker full of foul-smelling glop, and you would like 
to know whether or not it is indeed statistic acid. If it is not, it may be 
any of a large number of related compounds with boiling points 
diffusely (for the example, that means uniformly) distributed over the 
region from 130°C to 170°C. By one of those happy accidents so 
common in statistical examples, your thermometer is known to be 
unbiased and to produce normally distributed errors with a standard 
deviation of 1°. So you measure the boiling point of the glop, once. 
The example, of course, justifies the use of the classical critical 
ratio test with a standard deviation of 1°. Suppose that the glop really 
is statistic acid. What is the probability that the reading will be 151.96° 
or higher? Since 1.96 is the .05 level on a two-tailed critical ratio test, 
but we are here considering only the upper tail, that probability is .025. 
Similarly, the probability that the reading will be 152.58° or greater 
is 005. So the probability that the reading will fall between 151.96 
025-.005 = .02. 


and 152,58°, if the glop is really statistic acid, is .{ ( 2 
What is the probability that the reading will fall in that interval 


if the glop is not statistic acid? The size of the interval is .62°. If the 
glop is not statistic acid, the boiling points of the other compounds 


that it might be instead are uniformly distributed over a 40° region. 
n is simply the width 


S ili i ithin that regio 
o the probability of any interval within tha g aeae tee 


of the interval divided by the width of the regi s 
So if the compound is statistic acid, the probability ofa reading between 


151.96° and 152.58? is .02, while if it is not statistic acid that probability 
is only .0155. Clearly the occurrence of a reading in the region, 
especially a reading near its lower end, would favor the null hypot esis, 
since a reading in that region is more likely if the null hypot ae is 
true than if it is false. And yet, any such reading would lead to a 
rejection of the null hypothesis at the .05 level by the critical pena fears 

Obviously the assumption made about the alternative hypothesis 
was crucial to the calculation. (Such special features as normality, 


i i istributi Iternative hy- 
the i ity of the distribution under the al 
pre Y eulät regions and significance levels chosen are 


pothesis, and the part C k 
not at all aportan, they affect only the numerical details, not the 
basic phenomenon.) The narrower the distribution under the alterna- 
tive hypothesis, the less striking is the paradox; the wider that dis- 
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tribution, the more striking. That distribution is narrowest if it is a 
single point, and favors the alternative hypothesis most if that ee. 
happens to coincide with the datum. And yet ope ion Lindman E 
Savage (1963) show that even a single-point alternative hypo T 
located exactly where the data fall cannot bias the likelihood ra 2 
against the null hypothesis as severely as classical significance tes 
are biased. P d 
This violent bias of classical procedures is not an unmitigate 
disaster. Many null hypotheses tested by classical procedures are 
scientifically preposterous, not worthy of a moment’s credence oa 
as approximations. If a hypothesis is preposterous to start wit if 
no amount of bias against it can be too great. On the other hand, i 
it is preposterous to start with, why test it? , ll 
The implication of this bias of classical procedures against n 
hypotheses seems clear. If classical procedures are to be used, i 
theory identified with a null hypothesis will have several strikes agains 
it just because of that identifiċation, whether or not the theory is a 
And the more thorough the experiment, the larger that bias becomes. 
The scientific Conservative, eager to make sure that error is scotche 
at any cost, will therefore Prefer to test his theories as null hypothese 
to their detriment. The scientific enthusiast, eager to make sure ny 
his good new ideas do not die premature or unnecessary deat it 
will if possible test his theories as alternative hypotheses—to the 


7 ach 
advantage, Often, these men of different temperament will reac 
different Conclusions, 


The subjectivity of this 
There should be a better, 


l | s TS; 
ment with less arbitrary measures of congruence. A man from Mars, 
asked whether or not 


your suit fits you, would have trouble answering: 
He could notice the discrepancies between its measurements and you it 
and might answer no: he could notice that you did not trip ee 
and might answer yes. But give him two suits and ask him which its 
you better, and his task Starts to make sense, though it still has is 
difficulties. I believe that the argument between Grant and Bindet m 
essentially unresolvable: no procedure can test the goodness of fit wot 
Single model to data in any Satisfactory way. But procedures i 
comparing the goodness of fit of two or more models to the pees Ities 
are easy to come by, entirely appropriate, and free of the ia 
Binder and Grant have been arguing about. (They do have difficu 
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Most ee eee eee must specify to some extent the 
e data-generating process, or el i 
model of the data-generating process, such as the normali SD 
. ? t i 
a ave thermometer in the statistic acid RAE eet 
Sir rr ut of course this difficulty is common to all of statistics. 
e yas much a difficulty for the approaches I am rejecting as for 
n ee espousing.) The likelihood-ratio procedures I advocate do 
R a my se ofclassical null-hypothesis testing, and so the question 
T moc el to associate with the null hypothesis does not arise. 
hile there is nothing essentially Bayesian about such procedures, I 
ay prefer their Bayesian to their non-Bayesian versions, and so 
nee you to Savage (1962), Raiffa and Schlaifer (1961), Schlaifer (1959, 
1961), and Edwards, Lindman, and Savage (1963) as appropriate 
introductions to them. Unfortunately, I cannot refer you to literature 
telling you how to invent not just one but several plausible models 


that might account for your data. 
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much ado about the 
null hypothesis 
Warner R. Wilson 


Howard L. Miller 


Jerold S. Lower! 


Grant (1962) and Binder (1 963) have clarified the fact that two strategies 
can be used in theory testing. First, one can identify the theory under 
test with the null hypothesis and claim support for the theory if the 
null hypothesis is accepted. The second, presumably more orthodox 
and traditional approach is to identify the theory under test with the 
alternative hypothesis and claim support for the theory if the nu 
hypothesis is rejected, Binder has referred to these two approaches 
as acceptance support and rejection support strategies. In this papa 
however, the authors will follow Edwards (1965) and speak of “iden 
fying one’s theory with the null hypothesis” and basing support be 
the acceptance of the null vs. “identifying with the alternativ 


From Psychological Bulleti right 1964 
Bethe Aa g: ulletin, Vol. 67 (No. 3), 1967, pp. 188-196. Copy 


an Psychological Association. Reproduced by permission. afl 
1. The authors wish to express their thanks to Ward Edwards for his petS° 


See z on- 
communications with them and for the patient forbearance he has dan 


i : his 
coer in attempting to help clarify the ideas that are discussed 1 
2. The order of authors! a gel ` 


hip should not be interpreted as implying 


contribution on the part of the senior author. 
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hypothesis” and basing support on the rejection of the null and, of | 
course, the subsequent acceptance of the alternative. 

Binder (1963) has ably pointed out that either strategy may be used 
effectively under some circumstances. Wilson and Miller (1964a, 
1964b) have joined with Grant, however, in arguing that it is generally 
better to identify with the alternative and, hence, base support for a 
theory on the rejection of some null hypothesis. These writers pointed 
out that while the probability of rejecting the null hypothesis wrongly 
is held constant, for example, at the .05 level, the probability of 
accepting the null hypothesis wrongly varies with the precision of the 
experiment. To the extent that error is large, the study in question is 
biased for the null hypothesis and for any theories identified with it. 
According to this view, the conservative, cautious approach is to 
identify one’s theory with the alternative hypothesis. 

This essentially orthodox (Fisherian) view of classical statistical 
procedures has seemed so reasonable to the present authors that they 
were surprised to find Edwards, who identifies himself with Bayesian 
statistics, taking a dramatically opposed view. Edwards’ article is 
only one of several considering the relative virtues of classical versus 
Bayesian statistics (see Binder, 1964, for an excellent review). Edwards’ 
article seems especially important, however, due to the fact that it 
strongly urges changes in the tactics of the orthodox classical statis- 
tician—changes which might prove to be ill-advised in some cases 
and impossible in others. Edwards’ (1965) paper seems, to make or 
imply the following points: (a) Classical procedures, in fact, are 
always violently biased against the null hypothesis [p. 400]. 3 (b) The 
cautious, conservative approach, therefore, is to identify one s theory 
with the null hypothesis and, hence, base support for one’s theory on 
the acceptance of the null hypothesis. (c) The ideal solution, Poa 
is to compare the goodness of fit of several models to the same ata, thus 


avoiding the whole problem of null hypothesis testing. 
; R tly believes that a Bayesian analysis is always 


i t, the experiment in question is not worth 
doing. The present writers do not agree with this point. They do, 
however, find much that is admirable in Edwards’ position and certainly 
agree that Bayesian procedures are to be preferred—when ta can 
be used. The purpose of the present paper, therefore, is not to isagree 
with Edwards so much as to suggest clarification and qualification. 
ded that Edwards is com- 


In connection with Point a, it is conce i is i 
menting on differences between classical and Bayesian statistics that 


really exist. It is suggested, however, that the term “bias” is perhaps 
not the best way to sum up these differences. A Bayesian analysis 


typically assumes that the datum comes from a null distribution or 
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iti e 
from some other distribution. Classical statistics Spa kip 
datum comes from a null distribution or from some one eter 
possible distributions. The assumption that the datum may ability 
any distribution does, indeed, always increase the PDPA E 
that it comes from some distribution other than the null. Ke 
lies in the nature of the alternatives which are to be — In ernea 
Classical procedures happen to assume that all possib e ao or ok 
must be taken into account. Granted this assumption, any ae oe 
against the null is then aoe bed the eed level, which, 
i case of the .05 level, clear! y favors the null. . F 
J as relation to Point b, it is suggested that the bias for ie 
identified with the null hypothesis in imprecise experiments J a 
Grant talked about, exists independently of and is logically pie 
from the bias against the null hypothesis which Edwards talked a m 
Even when Edwards’ bias against the null hypothesis exists, it do 
imply the absence of Grant’s bias for the null hypathesis. ae 
considerations make the choice of tactics more complex than Edw sical 
article indicated. Edwards’ (1965) tactical advice was that, ee ‘vil 
Procedures are to be used, a theory identified with a null hypothesi a 
have several strikes against it, . . And the more thorough the ore 
ment, the larger that bias becomes [p. 402].” An attempt will be remi 
to show that this advice is valid only in experiments of ex ents 
Precision, a type presumably rare in psychology. As experim 0 
become imprecise, just the opposite tactical advice becomes apP 
priate. d 

In relation to Point ¢, it is suggested that no matter how many 


ich 
models we may have, many people will still find null hypotheses wh 
will seem to them to need testing. 


Is classical statistics 
always biased against 
the null hypothesis? 


— ays violently 
Edwards’ first point was that classical statistics is always aaa 
biased against the acceptance of the null hypothesis. He d, for 


persuasively for this point in several different ways. He presen ET 
one thing, the following example, which supposedly Men con- 
bias. This example and other points made by Edwards will ‘rout 
sidered here in hopes that the discussion will help clarify the ingfully 
stances under which classical procedures can and cannot meani ofte 
be said to be biased, and also give some indication of hon 
inappropriate rejections of the null hypothesis may, in fact, occ 


3 


n 


ee 
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The following example gives the flavor of the argument, though it is 
extremely crude and makes no use of such tools as likelihood ratios 
The boiling point of statistic acid is known to be exactly 50°C. [Presum- 
ably this number was intended to be 150°C.] You, an organic chemist. 
have attempted to synthesize statistic acid; in front of you is a beaver 
full of foul-smelling glop, and you would like to know whether or not it 
is indeed statistic acid. If it is not, it may be any of a large number of 
related compounds with boiling points diffusely (for the example, that 
means uniformly) distributed over the region from 130°C. to 170°C. 
By one of those happy accidents so common in statistical examples, 
your thermometer is known to be unbiased and to produce normally 


distributed errors with a standard deviation of 1°. So you measure the 


boiling point of the glop, once. 

_The example, of course, justifie 
ratio test with a standard deviation of 1°. 
is statistic acid. What is the probability th 
or higher? Since 1.96 is the 05 level on a tw 
but we are here considering only the upper tai 
Similarly, the probability that the reading will be 152.58° or greater is 
.005. So the probability that the reading will fall between 151.96° and 
152.58°, if the glop is really statistical acid, is .025 — -005 = .02. 

What is the probability that the reading will fall in that interval if the 


glop is not statistic acid? The size of the interval is .62°. If the glop is 
compounds that it might 


not statistic acid, the boiling points of the other ci 
ion. So the probability 


be instead are uniformly distributed over a 40° reg 
is simply the width of the interval 


of any interval within that region 

divided by the width of the region, .62/40 = 0155. So if the compound 
is statistic acid, the probability of a reading between 151.96° and 152.58° 
is .02, while if it is not statistic acid that probability is only 0155. 
Clearly the occurrence of a reading in that region, especially a reading 
near its lower end, would favor the null hypothesis, since a reading in 
that region is more likely if the null hypothesis is true than if it is false. 
And yet, any such reading would lead to a rejection of the null hypothesis 
at the .05 level by the critical ratio test [Edwards, 1965, p. 401]. 


s the use of the classical critical 
Suppose that the glop really 
at the reading will be 151.96° 
»o-tailed critical ratio test, 
l, that probability is .025. 


If we follow the mode of analysis Edwards used, that of comparing 
an area of the null distribution to an area of the alternative distribution, 
we can note first that the probability of actually rejecting the null 


when the data in fact favor it is quite low, even in this contrived 
example. It would seem that we would reject the null when the data 
actually favor it when the observation falls in the interval 151.96- 
152.40. When the nu 


Il is true, the data will fall in this interval, or in the 
corresponding lower inte 


rval, only about 3% of the time. 
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In addition, the type of analysis Edwards used has two aspects 
that seem intuitively unappealing: (a) All outcomes have a low 
probability under the alternative hypotheses, and (b) this probability 
is equally low no matter where an observation occurs. For example, ifa 
datum occurs at 150° the implied inference is that the probability of a 
hit in this segment of .62 is .62/40 or .0155; likewise, if a datum occurs 
at 169° the implied inference is that the probability of a hit in this 
segment of .62 is still .62/40 or .0155. 

Another way of looking at this example avoids both of the aspects 
noted above. Suppose, for simplification, that 1.96 is rounded off to 
2.00. The null hypothesis then implies that the observations should fall 
in a segment of 4° between 148° and 152° 95 % of the time. Suppose the 
actual reading is 152°, The probability of a hit as far as 2° away from 
150°, if the null is true, is only .05; therefore, the null is rejected. It 
would seem that in order to test the alternative hypothesis, one could 
ask, “What is the probability of a hit as close as 2° to 150° if the acid is 
not statistic?” The answer would seem to be 4/40 or .10. It would seem, 
then, that a reading of 151.96° or 152° is more probable if the null 
hypothesis is false, and rejection of the null would seem appropriate 
after all. Perhaps, then, the bias is not as prevalent as Edwards would 
lead us to believe. Indeed, if a person working at the .05 level of 
confidence is to Tun any danger of rejecting the null when the data 
actually support it, it would seem that the width of the distribution 
under the alternative hypothesis must be more than 80 standard 
deviations, The present authors will leave it up to Edwards to show that 
Situations of this sort occur frequently enought to justify concern. 

The reader may wish to note, however, that the apparent bias in 
cra Procedures can be manipulated at will. If either a broad 
prom distribution or a small error term is assumed, the bias will be 

Lindley (1957), another Bayesian, pres just such 
considerations in mind when he slated sia irran Ro invented 
that highly favor the null hypothesis yet lead to its rejection by 4 
properly applied classical test. Although such data can be invented, 
it is still meaningful to ask whether such data are inevitable or even 
likely in reality. By assuming a wide enough alternative, a bias can 
becreated ; however, in reality the alternative distribution is supposedly 
based on some theoretical or empirical consideration, and its width can- 
not be set arbitrarily. Error terms, on the other hand, can be reduced, 
even in reality, by the expedient of collecting more cases. However, 
in order for a reduced error term to lead inevitably to a bias, it is 
necessary to assume that as the error term is reduced, the absolute 
deviation from chance becomes less, so that the probability of the 
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deviation’s occurrence i i i 
of probability in the padre let ak pee b p 
in reality. Consider again the statisti i notos eaP A 
deviation of 2° and an error term ER EMRE pre 
probability of a hit at 152° is .05 ender T ROE 
l i i and .10 un 
abel In al life, any observation is more likely than sae 
ae i Ei value, so if data are collected until the error term becomes 
a 3 n ae of all the observations may still be near 152°. 
ee ; proba ility under the alternative would still be .10, but the 
oba ility under the null would have become far, far less, since the 
critical ratio would now be 20. Thus, although it is necessary to 
admit that data can always be invented which will make a classical 
test look biased, it is also possible to point out that such data are not 
necessarily obtainable in reality and that certainly it is also easy to 
invent data that make classical tests look unbiased. 


The considerations up to this point seem to support the assertion 
that classical statistics if blindly applied will sometimes be biased 
for the assertion that classical 


against the null hypothesis. Support 
statistics is always biased seems lacking. Instead, it appears that the 


extent of such bias depends in great part on the width of the alternative 
distribution relative to the width of the null distribution. 

In further exploring Edwards’ position, it may be noted that 
Edwards conceded that the bias becomes less as the relative width 
of the alternative distribution decreases. He firmly insisted, however, 
that no matter how narrow the alternative distribution, the bias will 
persist. Edwards seems to have come to this conclusion through an 
inappropriate comparison of probability levels with likelihood ratios. 


The likelihood ratio is the ratio of one probability density (at the 
a case such as the acid example, the 


point of the data) to another. In 

Bayesian would apparently rather base his conclusions on the likelihood 
ratio than on the probability level, and the present authors would have 
no objection so far as such cases are concerned. In commenting on the 
relation of likelihood ratios to classical significance tests, Edwards 


(1965) said: 

Even a single-point alternative hypothesis located exactly where the 
data fall [the form of alternative distribution that most violently biases 
the likelihood ratio against the null hypothesis] cannot bias the likelihood 
ratio against the null hypothesis as severely as classical significance 


tests are biased [p. 401]- 
ratios of ordinates) or 


Whether we formally use likelihood ratios ( 
simply other ratios of conditional probabilities (areas), what Edwards 


was saying seems to have a certain surface validity. If we consider only 
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the probability (or probability density at a point) under one distribution, 
the null hypothesis, and do not take into consideration how likely this 
event might be under some specified alternative distribution, we are 
biasing ourselves against our null hypothesis by comparing it essentially 
to all possible alternatives instead of comparing it to just one. It 
seems to us that this is exactly what the classical statistician intends 
to do. If we have a clearly defined alternative, and we can say that 
“reality” must be one or the other, then we can justify a likelihood ratio 
or some similar procedure. If we do not have such alternative models, 
we cannot invent them to avoid a theoretical bias. 

Edwards, Lindman, and Savage (1963) implied that in any inference 
based on statistics, the decision involved must be a joint function of a 
prior probability estimate (what you thought before about the likeli- 
hood of your hypothesis), the likelihood ratio, and the payoff matrix 
(the relative rewards and costs of being right and wrong about your 
hypothesis). Edwards seemed to imply that “classical statisticians” 


use no such considerations. It is doubtful that this is so. For one thing, 
a choice of significance level ca 


construed as a prior probability estimate—not the subjective one of 
the individual scientist but an admittedly arbitr: 


scientist's willingness to make an inference. An undisputed goal 
of science is objectivity—public reproducibility., To introduce, into 


implicit, The natural tendency is for investigators to believe that 
their hypotheses are correct and that the world can ill afford to ignore 
them. If such subjective inclinations were allowed full sway, experi- 
mentation could become superfluous, and science might well degenerate 
al statistics wisely resolves prior probability 
s in a conservative and standard rather than 
In a subjective and variable manner. 

_ In answer to the obvious criticism concerning the subjectivity of 
invented alternative hypotheses, Edwards? stated essentially that 4 
Scientist always has some information regarding alternatives. Bayesian 
Statistics considers this information; classical statistics does not 


2 W, Edwards, personal communication, October 1966. 
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rie nd me bias that Edwards was discussing is a function of the 
€ ssume?) this additional information. In point of fact. 
the choice of one particular alternative distribution with which to 
compare the null excludes other possible alternatives. It is precisely 
this exclusion (which may be justifiable under some circumstances) 
which increases the probability that you will accept the null. 
p In this section we have attempted to discuss such questions as 
“Are classical procedures biased against the null hypothesis?” and 
How often might such a bias result in rejections of the null when the 
data actually favor it or discredit it only weakly?” We see no reason to 
view classical procedures as biased against the null in any absolute 
sense—the opposite seems to be the case. We see no reason to view 
classical procedures as biased against the null hypothesis relative to 
Bayesian statistics since this comparison is at best unsatisfactory 
due to the difference in the procedures which are used and in the 
information which is assumed to be available. We can concede only 
that under special conditions, including presumably a specifiable 
alternative distribution, blind use of a classical analysis might result in 
a rejection of the null when a defensible Bayesian analysis, considering 
only the specifiable alternative, might show that the data actually 
support the null. We know of not one real-life instance in which the 
above has been demonstrated. It would seem to us that circumstances 
making such errors likely are not frequent, and it is suggested that the 
burden of proof is on those who think that such errors occur frequently 
—in the literature, as opposed to in the examples used in articles 


written by Bayesians. 


Should the scientific 

conservative always 

identify his theory with 

the null hypothesis? 4 

Edwards’ second main point was that if one does use classical statistics, 

the more conservative strategy is to identify one’s theory not with the 

alternative hypothesis—as Wilson, Miller, and Grant advocated—but 

with the null hypothesis. This point, too, is apparently incorrect. 
lassical statistics being 


What is more, even if the first point, about c 
always biased, were correct, the second point would not necessarily 


follow. In order to see the actual independence of the two points, it 
is helpful to realize that if something is biased, it is biased relative to 
something else. Edwards presumably meant that classical statistics is 
always biased relative to Bayesian statistics. 
On the other hand, when Grant, Wilson, a 
insensitive experiment is biased for the acceptanc 


nd Miller said that an 
e of the null hypothesis 
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and any theory identified with it, they meant that an pena pric 
ment is biased relative to a sensitive one. A classical analysis re 
biased against the null in comparison to a Bayesian analysis, an a 
-insensitive experiment can still be biased for the null in rn ag He 
sensitive experiment. The two biases exist independently. j z E 
more, the bias for the null hypothesis in insensitive experiments 
just as true with a Bayesian analysis as with a classical analysis. 
The following example (see Table 1) is offered as an illustration o 
the bias in insensitive vs. sensitive experiments—a bias which, prior to 
the Edwards article, the present writers would have considered not in 
need of illustration. The example views experiments as perfectly 
sensitive or as completely insensitive; null hypotheses as true or 
false; and theories as identified with the alternative and supported if 
the null is rejected, or as identified with the null and supported if the 
null is not rejected. . 
Edwards said that if you are a conservative, that is, if you wish to 
minimize undeserved successes, you should always identify your 
theory with the null. Wilson, Miller, and Grant said that in the case 
of insensitive experiments, such as are common in psychology, just 
the opposite tactic is to be recommended to the conservative. Table 1 
indicates the number of deserved and undeserved successes achieved by 
theorists depending on whether they do identify with the alternative 
or with the null. Overall, identification with the null clearly promotes 
success for one’s theory, a total of 285 (out of 400) successes vs. 115 


Table 1 


Limits of deserved and undeserved successes assuming 
identification with the null versus the alternative in perfectly 
sensitive and insensitive experiments 


Identification of theory Identification of theory 
with the null with the alternative 


Percentage Percentage Percentage EER 
deserved undeserved deserved undeserve' 
successes successes successes successes 
Sensitive experiments 
True null 95 0 0 5 
False null 0 0 100 0 
Insensitive experiments 
True null 95 0 0 5 
False null 0 95 5 0 


ee en, 
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ee ete identify with the alternative. ‘The same is true in the case 
undeserved successes. Those who identify with the null get 
total of 95 undeserved successes (out of 200) vs. 10 for cia 6 
identify with the alternative. The most dramatic difference ond oe 
course, in the case of insensitive experiments. The score is 95 illegiti- 
mate successes for those who test their theories as null hypotheses and 
only 5 for those who test their theories as alternative hypotheses. 
Hopefully, no one will wish to attempt to reconcile these outcomes 
with Edwards’ (1965) assertion: “The scientific conservative, eager to 
make sure that error is scotched at any cost, will therefore [always] " 
prefer to test his theories as null hypotheses—to their detriment 
[p. 402].” 
The authors would like to concede, however, that in recommending 
continued use of classical statistics combined with identification of 
theories with alternative hypotheses, they are operating on the basis 
of several beliefs which are not mathematically demonstrable. 
Perhaps the most critical belief in this context is that most experi- 
ments in psychology are insensitive. It may be noted that a conservative 
would not favor identification with the alternative on the basis of 
Table 1 unless he held this belief. In a perfectly sensitive experiment, 
only identification with the alternative leads to accepting one’s theory 
when it is false. This would occur because no matter how small the 
error, t’s of 1.96 and greater will occur 5°% of the time. These writers 


would point out, however, that in very sensitive experiments, specious 
deviations from chance, though technically significant, would be so 
It might also be argued 


small that they could hardly mislead anyone. 

that belief in ESP, for example, survives partly on just such deviations. 
Such a consideration might justify more concern about such errors, 
greater use of the .01 or 001 significance level, and greater interest in 


likelihood ratios when meaningful ones can be computed. ; a 
ing of “acceptance support” in sensitive 


Although the good show: f 
experiments is in its favor, it should still be noted that, seemingly, 
grounds are seldom available for deciding ifexperiments are going to be 
sensitive. In the face of this inevitable equivocality, investigators are 
encouraged to identify their theories with the alternative and so put 
an upper limit on error of 5% rather than 95%. One is not justified, 
after all, in assuming that favorable circumstances—true theories and 
sensitive experiments—will occur generally. A sensible strategy must 
assume the possibility of unfavorable circumstances—false theories 
and insensitive experiments—and must provide protection against 


these unfavorable circumstances. i 
Intuitively, it might seem impossible to ba: 


press any bias, on the basis of a completely insensi 


se any inference, or €X- 
tive experiment. This 
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consideration, however, is just the point of the objection to identification 
with the null. Identification with the null allows one to base positive 
claims for theoretical confirmation on the acceptance of the null 
hypothesis, which is, of course, virtually assured (p = .95) in the 
completely insensitive experiment. 

Yet another consideration relates to the fact that it makes relatively 
little difference which approach you use in precise experiments. 
With great precision, you cannot go too far wrong. As experiments 
become imprecise, however, the tactical choice makes an increasingly 
large difference. All the more reason, therefore, for the conservative to 
choose the tactic that protects him even when experiments are 
imprecise. 

The second belief is that statistics can sometimes reveal facts 
worth knowing even if they are not apparent to the naked eye. The 
relationships between smoking and lung cancer and between obesity 
and heart disease are examples. It is true that statistics may lead one to 
be overly optimistic about the importance of effects that are more 
significant than important, and certainly this tendency is to be deplored. 
On the other hand, although the human observer, unbeguiled by 
Statistics, may indeed discount many trivial effects, he may also infer 
strong effects when only a trivial effect or no effect at all is present. 
many people are, in the opinion of these authors, overly optimistic 
about the existence of ESP and the efficacy of psychotherapy. This 
Overoptimism exists, however, in spite of statistics, not because of them. 

__ Those who favor the so-called interocular test should realize that 
situations in which effects are large relative to error will yield their 
secrets quickly. This consideration suggests that investigators will 
inevitably spend most of their time on ambiguous situations in which 
the effects of Interest are small relative to the precision of measurement 
so far achieved. Statistics will be needed in such situations. 

The third belief is that false Positives are more damaging than false 
negatives. In the present context, this statement means that it is 
oes to view a false theory as already proved than to view a true 
1986 y as not yet proved. This belief is widespread (see, e.g., Campbell, 

9) and will not be belabored here. Granted the belief that false 
en cause more trouble, identification of one’s theory with the 
ee aos anturat The traditional .05 significance level ws 
experiments and ae tore oe ae eo a ol an 
identifies his theory with th vr tr ad 
Enhevent in, th Yy with the null, however, the natural conservatis 
Lae ae e traditional use of small probability levels works for, 
nai n against, hasty claims for support. Indeed, ifan investigator 

ntifies his theories with the null and if his theories are false, ÞIS 
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number of false positives approaches not 5%, but 95%, as his experi- 
ments become increasingly imprecise. 

The fourth belief is that many null hypotheses are worth testing. 
Edwards questioned this belief by suggesting that most null hypotheses 
are obviously false and that the testing ofthem is, therefore, meaningless. 
One of several possible replies is that the real question is frequently not 
whether the data deviate from the null, but whether the deviation is 
positive or negative. In such a case the testing of the null is obviously 
meaningful. 

If the conservative investigator believes that false positives are the 
greater threat and that his experiment will be insensitive, he will 
surely choose to identify his theory with the null hypothesis. On the 
other hand, none of the beliefs discussed so far necessarily justifies a 
preference for probability levels over likelihood ratios, and likelihood 
ratios are, in fact, strongly recommended whenever it is possible to 
compute them. The last belief, however, is that in most psychological 
experiments it is not possible to use a Bayesian approach based on 


likelihood ratios. The calculation of a ratio requires at least one 
specifiable alternative distribution. In most psychology experiments, 
no grounds are available for arriving at such a distribution. In such 
cases, classical statistics appears to be the only alternative, and, under 


this condition, the possibility of classical statistics being biased relative 


to Bayesian statistics is a meaningless issue. 3 T 
Jl as Bayesian statisticians 


Naturally, classical statisticians as we sian 
should consider alternative distributions. The point 1s that such 
considerations may or may not yield enough information to cand 
justify a Bayesian analysis. So long as such information N often, i 
not usually, lacking, Edwards’ rejection of classical approaches seems 


awards implied, to be sure, that investigators using classical 
tests have often rejected true null hypotheses without real evidence 
even when Bayesian statistics were potentially available. Bayern 
could perhaps make a definite contribution by analyzing a num er of 
published experiments and pointing out instances in which classical 
statistics has led to null hypothesis rejection when a potentially 
available Bayesian analysis would not have. It seems entirely possible, 
however, that such examples might be hard to find. At least they seem 
to be conspicuously absent from Bayesian critiques. It is also suggested 
that it would be most instructive if Bayesians would supply a precise 


i istributi f the following null 
Iternative distribution to accompany any 0 f 

ane which are presented as typical psychological problems: 
(a) partially reinforced subjects extinguish at the same rate as con- 
tinuously reinforced subjects; (b) patients show no change on tests of 
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adjustment as a function of counseling; (c) punishment does not 
influence response rate; and (d) students working with teaching 
- machines learn no faster and no better than students reading ordinary 
textbooks. Furthermore, the likelihood ratio, even if available, is not 
likely to disagree with the classical test unless the alternative distribu- 
tion is very broad. If the meaningful alternative is not uniform, as it 
was in the statistic acid example, but instead has a mode or modes 
somewhere in the neighborhood of the null value, the likelihood ratio 
and significance test are even more likely to point in the same direction. 
Another tactical point that merits attention is the question of 
whether a null hypothesis rejection resulting from a sensitive experi- 
ment should always be viewed as a legitimate success. As Table 1 
indicates, ifa null hypothesis is false, a sufficiently sensitive experiment 
will always reject it, and theories identified with the alternative will 
always be Supported. Table 1 views such successes as legitimate. 
The Possibility always exists, however, that the difference, though real, 
may be so slight that recording it in the literature is a complete waste 


ae he present vee think that the indiscriminant cataloguing 
ial ellects is, in fact, a major problem i today, an 
they would certain! jor p m in psychology today. 


ẹrtainly regret it if their position was in any way interpreted 
as encouraging this unfortunate et Sy On die then hand, as 
Wilson and Miller have pointed out, one must weigh one problem 
en ee If investigators identify their theories with the alterna- 
K 'ypothesis, they may be tempted to run many subjects and report 
ories to be true, even though the thories have no predictive utility. 
a the other hand, if investigators identify their theories with the null 
: go miesis they may be tempted to run few subjects and report theories 
© De true, even though they are completely false. The present writers 
find no difficulty in deciding that they would prefer to confront 
Investigators With the first temptation rather than the second. 
Spon of the problem of accepting trivial effects does not, 
with the oo y modify this Paper’s inclination towards identification. 
een enale hypothesis in combination with the use of classica 
the b Strategy may not be the best imaginable, but it is still often 
i D available, and it is recommended to the scientific conservative. 
Ei E up this section, we wish to remind the reader that 
e pparently recommended that investigators switch tO 
$ yesian techniques altogether. He went on to say, however, that ifone 
oie a l statistics, the more conservative tactic, that is, the 
pete minimizes erroneous claims for theoretical support, ia 
tines: no with the null. Our perspective on this advice can 
inna P very briefly: If a Conservative is a person who wishes a 
on a tactic that puts a ceiling on error of 1 in 20 and adopt instead 
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a tactic that puts a ceili i is i 
“een p iling on error of 19 in 20, then this is excellent 


Does the development of 
multiple models avoid the 
need for null hypothesis 
testing? 


Edwards last point was that the better tactic is to compare your data to 
everal plausible models, and hence avoid null hypothesis testing 
altogether. Multiple models seem thoroughly desirable, and it seems 
worthwhile to note that in a traditional two-tailed test, classical 
statistics always implies three families of models: one predicting no 
difference, one predicting a positive difference, and one predicting a 
negative difference. On the other hand, it is hard to see how multiple 
models avoid the need for nùll hypothesis testing. Again an example 


from Edwards’ (1965) paper may be helpful. 


hether or not your suit fits you, would have 
Id notice the discrepancies between its 
ht answer no; he could notice that you 
t give him two suits and ask 


to make sense [p. 402]. 


A man from Mars, asked w 
trouble answering. He cou 
measurements and yours, and mig 
did not trip over it, and might answer yes. Bu 
him which fits you better, and his task starts 
But ask him if he’s sure of his decision or if he might reverse it if he saw 
you model the suits again, and you have a null hypothesis to test. 
In other words, although on one occasion one model (suit) was judged 
to fit the data (you) better, one must still ask if the difference in fit is 
significant relative to the potential sources of error. However stated 
and however tested, this question still seems to constitute a null 
hypothesis. Bayesian statistics may offer a meaningful alternative to 
null hypothesis testing, but it will take more than this example to 


convince the present authors. 


Are undefined alternative 
hypotheses the fault of 


classical statistics? 
bias against the null 


s obvious as Edwards’ bias against 
o note one aspect of the rationale 


From the current vantage point, the classical 
hypothesis does not appear as 
classical statistics. It is instructive t 0 D t na 
behind Edwards’ (1965) bias: “The trouble is that in classical statistics 
the alternative hypothesis is essentially undefined, and so provides 

hich to judge the congruence between 


no standard by means of w 
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» he 

datum and null pees [p. 402]. an ns eee a 
e of a specifiable alternative as a short-cc 

oe It seems appropriate to point out that, in fact, ea 
Br a well-defined alternative is not a vice of classical Tea M: 
absence of a well-defined alternative is a problem, a problem w ee 
unavoidable in many cases, a problem which Bayesian pees 
presumably cannot handle, and also a problem which classical stati 
is especially designed to avoid. 


Summary 


Some of the main points of the Position of the present writers mi ag 
summed up as follows: (a) Identification with the alternative: ae 
erroneous claim for theoretical support to 5%; identification with j 
null limits such errors to 95 % The conservative is, therefore, pee i 
ably better advised to identify with the alternative. (b) Cai 
procedures assume that only the null distribution can be specified, a A 
ask if the data are from this distribution or some other. Bayerin 
procedures assume that two distributions can be specified, and a e 
the data are from the null distribution or the alternative. Granted a 
difference in procedure and in the information assumed, discussion : i 
bias in one procedure relative to the other seems of dubious meaning k: 
ness. (c) Granted the information assumed by classical procedures, 
they show an absolute bias in favor of the null hypothesis. 
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In selecting the material for this book, the primary goal is to 


facilitate the training of students to become better researchers, It is 
the authors’ belief that this goal can be implemented most 
effectively by providing essential information in certain key areas of 
experimentation, Therefore, the readings which have been selected 
are intended primarily to provide the student with a clearer 
description and understanding cf some uncontrolled factors which 


can affect the outcome of an experiment independently of the 
variables that are deliberately manipulated. 


Organization of the articles follows no rigid pattern, The student can 
- begin with any block of readings he finds most appealing. 
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